Fast computer data segmenting techniques

ABSTRACT

Versions of the invention are directed to computer-based methods, apparatus and software (programs) for fast, dynamic programming and recursive partitioning techniques to segment data, especially real-world data, into data structures for display as nodal trees. These techniques and displayed data in segmented form have numerous applications, especially for the analysis and understanding of real-world data. Some particular applications are in the area of computational high throughput screening of molecular drug (or pharmaceutical) candidates using a quantitative structure activity relationship (QSAR) approach. Another particular application is in the areas of pharmacogenomics and pharmacogenetics.

[0001] The present patent application claims priority from U.S.provisional patent application No. 60/225113.filed Aug. 14, 2000 and allof the contents U.S. provisional patent application No.60/225113 areincorporated herein by reference and to the fullest extent of the law.The present application is a CIP of PCT/US01/25519 (having the sametitle) filed Aug. 14, 2001 and PCT/US01/25519 is incorporated herein byreference in its entirety and to the fullest extent of the law. Thepresent application claims priority from U.S. provisional patentapplication No. 60/358631 filed Feb. 20, 2002 and all of the contentsU.S. provisional patent application No. 60/358631 are incorporatedherein by reference and to the fullest extent of the law.

TECHNICAL FIELD

[0002] Versions of the invention are in the field of computer-basedmethods and techniques for segmenting data into homogeneous segments(similar subgroups). Such data includes for example, real-world datathat represents real-world objects and phenomena. Applications innumerous fields exist (for example, see below). Versions of theinvention are generally in areas that are often referred to as recursivepartitioning, data mining, data processing or cluster analysis.

[0003] Some versions are specifically in the areas of computationalchemistry, pharmaceutical high throughput screening and genetics. Somesuch versions of the invention segment molecules such as drug candidatemolecules into homogeneous segments, wherein each homogeneous segment isessentially a subgroup of drug candidates having a similar property andsimilar characteristics (or descriptor values). Some versions of theinvention display data in segmented form (on a monitor or equivalentdevice) for practical use by a human operator. One such practical use isfor research and development purposes in the pharmaceutical industry.Some versions of the invention display data in segmented form forpurposes of research and development.

BACKGROUND

[0004] Computer-based Segmenting Algorithms

[0005] The use of computer-based segmenting algorithms to segment agroup of sequential data into like parts (similar subgroups) is a knowntechnique.^(I) Such segmenting algorithms collect data values intosimilar subgroups, wherein each subgroup corresponds or belongs to asegment. These algorithms and methods essentially “segment” the data, sothat data (or data values) within each segment are essentiallyhomogeneous (see FIG. 5 in the Appendix as an example). And a measure ofthe homogeneity of the data in each segment is frequently calculated.And an overall (for all the segments combined) measure of thehomogeneity of the data (or data values) in each segment is frequentlycalculated. An important advantage of these segmenting algorithms is forcorrelation purposes. (Ref 1 endnotes, page 390)

[0006] Data or data points in such a segmented form is often easier towork with and easier to understand. For this reason computer-basedprocesses that “segment” such data, as well as data in segmented formhave great utility. Applications of such data segmenting processes, aswell as data in segmented form occur in a multitude of fields. Even inthe field of geology there are many such applications to geologicaldata, these include mechanical logs of bore holes, x-ray data, seismictraces, magnetic profiles, and land-resource observations made alongtransects. (see reference 1 endnotes, p. 390).

[0007] A dynamic programming (DP) segmenting algorithm was developed byHawkins. This Hawkins DP algorithm finds one or more essentially optimaldata segmentations or “coverings” by essentially calculating an overallmeasure of segment homogeneity for each possible segmentation (orcovering).¹¹ One or more coverings with the optimal value of overallhomogeneity are then selected by the algorithm. This DP algorithm was animprovement, in terms of running time, over non-DP approaches. (seeReference 1, pp. 390-391 and Description section for more details)

[0008] Recursive Segmenting, Methods of Recursive Partitioning

[0009] Segmenting techniques have continued to evolve. For example, oneor more segmenting algorithms have frequently been used to segment datarecursively (or repeatedly). Such recursive techniques result in arecursive partitioning (RP) of data into subgroups. One knowncomputer-based scheme that uses a combination of segmenting algorithmsand RP techniques is FIRM. FIRM stands for Formal Inference-basedRecursive Modeling. FIRM was developed by Professor Hawkins and ispublicly available (see Description section for more details).

[0010] Conventional Segmenting Techniques Limited by Long ComputerRunning Times

[0011] Despite continued evolution of segmenting techniques, thesetechniques continue to have a major limitation. This major limitation ofconventional segmenting algorithms is that they frequently work slowlywith large amounts of data or large numbers of data points. The HawkinsDP algorithm also has this limitation.

[0012] The long running times of conventional segmenting algorithms area significant problem for many potential applied fields of usage ofsegmenting techniques. This significant problem exists in the area ofcomputational chemistry, high-throughput screening of pharmaceuticalsand genetics analysis, where the amount of data to be segmented isenormous.

[0013] The Great Need for Better High Throughput Screening ofPharmaceuticals

[0014] A veritable explosion in the number of compounds available aspotential pharmaceuticals has recently taken place. Large numbers ofdifferent types of compounds are being physically tested for biological,medical and pharmaceutical properties. And a vast amount of informationor data on both tested and untested compounds is being accumulated. Suchdata is being stored in large chemical libraries. Such libraries haveboth general and specific (focused) data on chemical compounds that arepotential pharmaceuticals.

[0015] In addition, the number of potential pharmaceuticals will begreatly increased by the Human Genome Project. This project willidentify numerous new “drug targets”. These targets are places at themolecular level for a drug to act or exert its effect. Such an increasein drug targets will also greatly increase the number of potentialpharmaceutical compounds.

[0016] Research and development to find new and useful pharmaceuticalshas usually required sifting through large numbers of candidatecompounds in order to find promising candidates. One method of screeningcandidate compounds is to physically test the candidate compounds. Init's simplest form, screening by physical testing is essentially “trialand error” and requires testing essentially every candidate. Even moresophisticated physical testing procedures require a great deal ofeffort, time and expense.

[0017] Current methods of screening large numbers of candidates areknown as high throughput screening (HTS). Significant advances in thetechnology for the testing of compounds for desirable pharmaceuticalproperties have occurred, yet HTS still has great deficiencies.

[0018] Current HTS techniques simply cannot screen the number of newlyavailable potential candidate pharmaceuticals. Limitations in currentHTS methods cause delays in bringing drugs to market, resulting in greatlosses in potential profits. And many large-scale high throughputscreening attempts still fail to identify a good lead compound(prototype drug molecule) to stimulate further research.

[0019] Computer-based methods of screening pharmaceutical candidateshave the potential to save expense, time and work in high throughputscreening.

[0020] Computer-based methods of screening molecules (or compounds) aremethods of reducing the workload, time and expense of screening byphysical testing. Such computational approaches attempt to identifypromising candidate compounds (or molecules) with desirablepharmaceutical properties.

[0021] For example, a certain group of compounds may be known to possessa desirable pharmaceutical property. A computer or human judgment thenidentifies molecular or chemical characteristics of the compounds inthis group. A computer-based identification of other compounds that havethe same (or similar) molecular characteristics is then done to form anew group of promising candidate pharmaceutical compounds. The candidatecompounds (or molecules) in this new group has an increased probabilityof possessing the desired property, despite having not been actuallyphysically tested.

[0022] Thus, a promising new group of candidate pharmaceuticals has beenidentified without the actual physical testing of the compounds in thegroup. And much work, time and expense have been saved. The compounds inthe group can then be subjected to further investigation.

[0023] Computational HTS using QSAR

[0024] Most important computational screening approaches are based onthe idea that a particular pharmaceutical property of a compound is dueto the compound's molecular structure. In effect these approaches assumethat the property is due to the compound's shape at the molecular level.Such “quantitative structure-activity relationship” or QSAR approachesattempt to characterize the parts of a molecule's shape that contributeto the pharmaceutical property or “activity”. Such important molecularparts (pieces of a molecule) are sometimes referred to aspharmacophores. Just as keys fit into a lock, molecular parts such aspharmacophores of the right shape cause their effects by fitting intoother “target molecules” in the human body. (These target molecules aresometimes called receptors.) In effect, QSAR approaches are similar tolooking for “molecular puzzle pieces”—pharmacophores or molecular partshaving about the same molecular shape or characteristics.

[0025] Most computational HTS methods using QSAR approaches are tooidealized to handle real-world situations

[0026] Most computational QSAR approaches use idealized mathematical andstatistical models. However, these idealized models are too simplisticto accommodate the complexities of real world molecular structure andthe structure-activity relationship between a drug and it's target. Realworld molecular structures (and QSARs) exhibit complexities that are notidealized. Therefore there is a great need for more realistic methods ofcomputational high throughput screening using QSAR approaches.

[0027] Methods of recursive partitioning are realistic and can deal withrealities of computational HTS

[0028] Methods of recursive partitioning (RP) can deal with realities ofcomputational HTS, including those of computation HTS methods that useQSAR approaches. Methods of RP are able, for example, to handlerealities such as interaction effects, threshold effects andnonlinearities. This realization has spawned the development of newmethods of RP in high throughput screening.

[0029] Some Recent Methods of RP in Computational HTS

[0030] One such recent method uses RP techniques to separate drugcandidates into subgroups (or nodes of a tree), wherein drugs in nodesare similar in terms of number of specific molecular fragments andpotency.^(III) A second RP method generates binary trees, wherein eachnode is split into two daughter nodes. In this method drugs are groupedinto nodes, wherein drugs in nodes are similar in terms biologicalactivity and only one of the two categories of (1) presence or (2)absence of specific chemical descriptors.^(IV)

[0031] Even new RP methods of HTS (including those that use QSARapproaches) are often essentially limited to binary splitting or smalldata sets.

[0032] A third RP method uses chemical or molecular descriptors that aregenerated from 2D topological representations of molecular structures.Such descriptors include atom pairs separated by minimal topologicaldistance, topological torsions and atom triples employing shortest pathlengths between atoms in a triple. This third method while usingdistance and topological type descriptors also generates only binarytrees. Thus the method is also essentially limited to a presence orabsence type of categorization (or splitting). This reference indicatesthat segmenting into more than two daughter nodes using techniques suchas FIRM is essentially limited to working with small amounts of data,because of increases in computer run time.^(V) This referenceessentially indicates that viable general RP packages for HTS arelimited-to small data sets. See also related U.S. Pat. No. 6,434,542.

[0033] There is a great, unmet need for faster computational HTS-QSAR,RP techniques employing multi-way splitting using geometry-basedmolecular descriptors.

[0034] Binary splitting is essentially a two category, (1) presence or(2) absence type approach. Such binary splitting cannot take fulladvantage of the dimensional measurement information present incontinuous variables or descriptors such as distance type descriptors.

[0035] By contrast, multi-way splitting (or categorization) is generallymore versatile than mere binary splitting. Like an ordinary ruler,multi-way splitting divides quantities such as distances into gradatedsegments based on number measurement. If such multi-way splitting couldbe done using geometry-based molecular descriptors (such as moleculardescriptors based on distances between parts of a molecule), there wouldbe a fuller and more natural use of the actual dimensional measurementinformation present in geometry-based molecular descriptors. Moleculescould then be sorted into segments wherein the molecules in each segmenthave about the same actual geometric measurements of like molecularparts.

[0036] However, this great need of multi-way segmenting usinggeometry-based descriptors has remained unfulfilled. This is becauseconventional HTS-QSAR, RP techniques with distance type descriptors areessentially only viable with binary splitting. These conventionaltechniques, which use conventional segmenting algorithms, are too slowto do multi-way splitting.

[0037] Fast segmenting algorithms make possible computational HTS-QSARapproaches that employ multi-way splitting RP techniques withgeometry-based molecular descriptors.

[0038] The inventor's novel Fast Segmenting Algorithms make multi-waysplitting using geometry-based molecular descriptors a reality bygreatly increasing speed and decreasing computer run times. These FastSegmenting Algorithms (FSAs) lead to inventions that fulfill the greatunmet need.

[0039] Versions of the invention fulfill the great need of truesegmenting using geometry-based descriptors in computational HTS

[0040] Versions of the present invention are computer-based methods thatperform multi-way segmenting on molecules (such as drug candidates)using geometry-based molecular descriptors. These computer-based methodsuse, or have the potential to use, one or more fast segmentingalgorithms to perform their segmenting. Versions of the invention areviable RP software packages for multi-way segmenting of large data setsof drug candidates and the candidates' geometry-based moleculardescriptors. These software packages are fast enough to allow aresearcher to interact meaningfully with a package program duringoperation. Thus versions of the invention fulfill the great need for acomputational RP segmenting method in pharmaceutical HTS that makes fulland natural use of the dimensional measurement information present ingeometry-based molecular descriptors.

[0041] Versions of the invention sort candidate molecules intosubgroups. The molecules in each subgroup have molecular parts withabout the same geometric measurements. Pharmacophores sought by HTSmethods are important examples of such molecular parts.

[0042] Fast Segmenting Algorithms (FSAs) using geometry-baseddescriptors sort a group of candidate drug molecules into segments (orsubgroups). The molecules in each segment (or subgroup) have molecularparts with about the same geometric measurements. When segmenting usinggeometry-based descriptors is done repeatedly (or recursively) groupmolecules are sorted into segments (or subgroups) on the basis ofmultiple geometric measurements. Such recursive segmenting orpartitioning of a group of molecules generates a nodal tree (similar tothe tree in FIG. 2). Group molecules are sorted into nodes (orsubgroups) so that the molecules in each node have similar molecularparts, these parts have about the same actual geometric measurements. Ineffect, the nodal tree effectively sorts the molecules, so thatmolecules in some nodes have a molecular part or parts that arepharmacophores with about the same geometric measurements. This fuller,more natural use of geometric information makes for more powerfulmethods of finding molecules that are sought by computational HTS-QSARprocedures. In effect HTS-QSAR approaches that employ RP techniques andmulti-way splitting with geometry-based descriptors can find (andpredict) more exact and better fitting “molecular puzzle pieces” andmolecules. These candidate drug molecules with molecular parts orpharmacophores are the “better fitting molecular puzzle pieces” that arethe ultimate pursuit of computational HTS-QSAR procedures.

[0043] Some details of the operation of versions of Fast SegmentingAlgorithms

[0044] Conventional segmenting algorithms essentially compute an overallmeasure of segment homogeneity (sometimes referred to as a score) forall possible segmentations or splits of a data set. Versions of FastSegmenting Algorithms (FSAS) achieve their increased speed by computingan overall measure of segment homogeneity (or a score value) for onlysome of the possible splits of a data set. In addition, some versions ofFSAs compute a score value for only some select splits. These selectedsplits have a high probability of being a (or the) split with an optimalscore value. FSAs also make use of techniques of dynamic programmingsuch as running sums and updating. Thus versions of FSAs are fast DPalgorithms that find one or more splits of a data set, wherein thesplits are probable optimal splits.

[0045] There is a multitude of potential applied uses for FastSegmenting Alqorithms and Special Score Functions.

[0046] Just as there is a great need for fast segmenting techniques andFSAs in pharmaceutical high throughput screening, these techniques andalgorithms have great potential in general chemistry or generalcomputational chemistry. In addition, potential uses of fast segmentingtechniques and algorithms are present in a multitude of fields. A fewother examples of fields in which real-world data in segmented form hasgreat utility include clinical trials analysis (relating physiologicaland environmental factors to clinical outcomes, genetics (relatinggenetic descriptions of organisms to other organism characteristics),geology (finding minerals and oil), modeling nosocomial infections inhospitals, market research (market segmentation), industrial qualityimprovement (wherein data are frequently “messy” or nonidealized), anddemographic studies. (No reference, technique or invention is admittedto being prior art with respect to the present invention by it's mentionin this background or summary.) Professor Hawkins has also inventednovel measures of segment (or intra-segment) data homogeneity, specialscore functions (see below).

SUMMARY

[0047] The inventor has invented new Fast Segmenting Algorithms. TheseFast Segmenting Algorithms are fast computer methods that “split” or“segment” data into segments (or subgroups) so that the data (or datavalues) within each segment are similar (or homogeneous). Conventional,slower DP segmenting algorithms compute scores for all possible splits.Versions of these new FSAs are fast because they mainly compute ameasure of homogeneity (or score) for only select splits using dynamicprogramming (DP) techniques that speed up the calculations. These selectsplits have a high chance of being the best, or about the best splits(the most homogeneous splits). One or more of these algorithms usedalone, in combination or repeatedly in a recursive partitioning (RP)procedure are versions of a new invention with a multitude of potentialapplications.

[0048] For example in the field of pharmaceutical high throughputscreening (HTS), FSAs fulfill a great unmet need. These FSAs lead to newways of sorting molecules that are possible new (candidate) drugs intosubgroups of molecules that have the greatest potential to be new drugs.Just as an ordinary ruler can categorize objects by length, these newsorting methods use multi-way splitting with geometric molecularcharacteristics to categorize molecules into subgroups. This fuller,more natural use of geometric information makes for fast, practicalcomputer methods that can find (and predict) molecules with molecularparts (or pharmacophores) that have a good, geometric molecular fit—justas keys fit into a lock. In the search for new drugs, these candidatedrug molecules and their pharmacophores are the “better fittingmolecular puzzle pieces” that are the ultimate pursuit of thepharmaceutical industry's massive high throughput computer screeningprojects.

[0049] By contrast, conventional computer sorting techniques used forhigh throughput pharmaceutical screening do not make such full use ofgeometric information. Even conventional techniques that use distancetype characteristics of drug candidates are too slow to segmentmolecules into multiple categories. Instead, these slow conventionaltechniques use only a (binary) two category, yes-no type ofclassification scheme.^(VI)

[0050] Conventional Techniques Techniques that are analogous to theslow, conventional pharmaceutical screening techniques have only a(binary) two category, yes-no type of classification scheme. Theseconventional techniques can only sort the candidates into two groups,such as (Al) those 6 feet tall and (A2) those not 6 feet tall, or (B1)those who jump 1 foot and (B2) those who jump a heighth other (higher orlower) than 1 foot, or (C1) those whose arms are 2 feet long and (C2)those whose arms are not 2 feet long, etc. These conventional techniquesare too slow to do real segmenting.

[0051] Fast Segmenting Techniques By contrast techniques that areanalogous to FSAs for pharmaceutical screening, can sort the candidatesinto segments (A1) those 5 to 5.5 feet tall, (A2) those 5.5 to 6 feettall, (A3) those 6 to 6.5 feet tall and (A4) those over 6.5 feet tall;(B1) those who jump 0.5 to 1 foot, (B2) those who jump 1 to 1.5 feet(B3) those who jump 1.5 to 2 feet (B4) those who jump over 2 feet; (C1)those whose arms are 1.5 to 2 feet (C2) those whose arms are 2 to 2.5feet long and (C3) those whose arms are over 2.5 feet long; (D1) thosewhose run time down the court is less than 3.2 seconds, (D2) those whoserun time down the court is 3.2 to 3.5 seconds, (D3) those whose run timedown the court is 3.5 to 4 seconds (D4) those whose run time down thecourt is over 4 seconds. Suppose an ideal candidate to play guard (acertain position on the team) is generally (A2) 5.5 to 6 feet tall, (B2)jumps 1 to 1.5 feet, (C2) has arms of length 1.5 to 2 feet and (D1) runsdown the court in less than 3.2 seconds. FSA techniques can generate anodal tree such as FIG. 2 with “an ideal guard node” (or subgroup) thatonly contains candidates who are in all four segments (A2), (B2), (C2)and (D1). These people are good candidates to be a guard. Suppose thereare 100 such good candidates in the node. FSA techniques are fast enoughthat they allow human interaction, so a researcher could further splitthe ideal guard node into three more nodes based on weight: (E1), (E2),and (E3). Suppose that generally an ideal guard's weight is in segment(E1), weight less than 175 lbs. Ideal candidates are now in the nodethat corresponds to the five segments (A2), (B2), (C2), (D1) and (E1).Each of the candidates in this node have the measurements that make a“good fit” for the job of basketball guard.

[0052] Suppose further that this node contains 50 candidates. The 50still have to be tested physically by playing basketball. A physicaltest is especially important in this case, human beings are notmolecules. There may even be very good guards who are not in the node.Computers are powerful tools, but not all knowing.

[0053] Fast Segmenting Algorithms have practical uses in many fields.Many types of data that displayed in segmented form are easier work withand easier to understand. A few examples of such types of data are fastenough to allow a human user to interact meaningfully with an RPsoftware package that uses FSAs to segment data. Furthermore, these FSAsgive rise to inventions that are not just computer programs. Theseinventions include (but are not limited to) special purpose computersprogrammed for specific tasks and data structures—computer data in anarranged format.

[0054] Special score functions invented by Professor Hawkins also havenumerous and similar applications. This background and summary are notnecessarily exhaustive.

[0055] Patents

[0056] Some patent publications which may be useful in understandingversions of the invention are U.S. Pat. Nos. 4,719,571; 5,787,274;6,182,058; 6,434,542; and publication T998008. U.S. Pat. No. 6,434,542also deals with recursive partitioning of molecules and individuals.None of these patents or publications is admitted to being prior art bytheir mention in this background.

BRIEF DESCRIPTION OF DRAWINGS

[0057]FIG. 1 is an illustration of atom class pairs and geometry-basedmolecular descriptors.

[0058]FIG. 2 Illustration of a Nodal Tree generated by a version of theinvention, described in Example 1. The data objects are molecules andthe descriptors are geometry-based molecular descriptors. FIG. 2 issimilar to a screenshot. FIG. 2 is typical of the appearance of NodalTrees displayed by versions of the invention on a monitor.

[0059]FIG. 3 Data representation for a Group (or Node) of n DataObjects, Matrix Data Representation: There are N data objects in thegroup (or node) of Data Objects. The data objects are denoted O₁, O₂, .. . , O_(n). The property is denoted as P. And the M descriptors aredenoted D₁, D₂, . . . D_(M). Each row of the matrix corresponds to adata object. The first matrix column corresponds to the property P. Andeach of the other matrix columns correspond to a descriptor. The valueof the property and each descriptor for each data object is recorded inthe corresponding matrix cell.

[0060]FIG. 4 is an illustration of a Segmenting Nodal Tree Generating(or Growing) Process similar to GenSNTGP#1. The Figure illustrates thatversions of the Invention output data to a display device, to a storagedevice, or send data (such as over the internet) or some combination oftwo or more of these.

[0061]FIG. 5 Data Segmentation Example The histogram of y is depicted atleft. It is unobvious without plotting y versus x that there is anypattern to the data. We see at right, that the mean of y is constantwithin certain ranges of x. The optimal segmentation of this data wouldbe to divide x with cutpoints at 0.3, 0.45, 0.55, and 0.7.

DESCRIPTION

[0062] The use of computer-based segmenting algorithms to segment agroup of sequential data into like parts (similar subgroups) is a knowntechnique.¹

[0063] Computer-based segmenting techniques have continued to evolve andbecome more sophisticated.² Because it is a good teaching that isrelevant to the application and is open to the public, reference 2(Musser B J, Extensions to Recursive Partitioning. Doctoral thesis(October 1999) under the supervision of Professor Douglas M. Hawkins,School of Statistics, University of Minnesota, St. Paul, Minn. 55108USA) is incorporated herein by reference to the fullest extent of thelaw. This thesis is in the public domain. A copy of the thesis in PDFformat on floppy diskette is included with the U.S. provisional patentapplication No. 60/225113 filed Aug 14, 2000.

[0064] Some important teachings of reference 2. Reference 2 teaches theuse of segmenting algorithms combined with methods of recursivepartitioning.

[0065] In reference 2, some types of data are designated predictors (forexample X on page 10). There are different types of predictors: forexample, monotonic, free, float (pp. 4 and 6). Monotonic predictors areessentially quantitative in nature. Free predictors are essentiallynominal in nature. And float predictors are commonly used to representor accommodate “missing data”. Other types of data are designated to beresponses (for example Y on page 10). Pairs of predictor and responsevalues are made (for example (X_((i)),Y_((i))) on page 10). Suchpairings tend to conceptualize a response to be predicted by, correlatedwith or caused by a predictor.

[0066] As in other segmenting algorithms and methods, those taught inreference 2 (and similar methods) frequently use (1) one or moremeasures of the homogeneity of data (or data values) within segments, or(2) an overall measure of the homogeneity of data (or data values)within segments for all (or most of) the segments or (3) a combinationof (1) and (2) to “segment” data. It is also possible for these methods(and similar methods) to use measures of inter-segment data valueinhomogeneity to segment data. Some important measures of homogeneity orinhomogeneity used by these and similar methods are least square typemeasures, deviance measures and statistical measures. Some of thecomputer-based segmenting methods taught in reference 2 and similarmethods generate data structures such as dendograms, nodal trees andequivalent structures. Such data structures elucidate correlation,prediction or causal type relationships between one or more predictorsand a response in many cases. Several examples of nodal trees (orsimilar data structures) are given in reference 2.

[0067] An example of a method of generating one or more nodes of such anodal tree or trees is given by a flow chart in reference 2 (FIG. 1.1,page 5). Central to the teachings of reference 2 is the technique ofFIRM (Formal Inference-based Recursive Modeling, chapter 1) and similartechniques of Random FIRM (chapter 6) and NextFIRM (chapter 7 andchapter A, including computer code).

[0068] Some advantages of Random FIRM are discussed on pages 113 and114. Random FIRM has the capability of generating a tree or trees bysegmenting using a predictor that does not give the best possiblesegmentation of the data or smallest p-value. Such capability is closelyrelated to segmenting algorithms and methods that are less deterministicand give one or more approximately best segmentations of data. Alsoimportant in the teaching of reference 2 are techniques of dynamicprogramming.

[0069] Additional references that are related to those already given andthat shed light on aspects of segmenting algorithms and other conceptscited in this application are given in the endnotes. These referencesare incorporated herein to the fullest extent of the law.^(3,4,5,6,7)(No reference of reference 1 through 7 inclusive is admitted to beingprior art with respect to the present invention by its discussion ormention in this description.)

[0070] The DPSA Appendix contains a more detailed description of DPSegmenting Algorithms^(VII) such as the Hawkins DP algorithm and FSAs.As described above and in references 1-9 inclusive (endnotes) and theDPSA Appendix, a segmenting algorithm essentially segments data, so thatthe Y values (or response values) that correspond to each segment areessentially homogeneous. Each segment is an interval or a grouping of Xvalues (or predictor values). And each Y value (or response value) isassociated with an X value (or predictor value). Thus, in a simpleusage, a segmenting algorithm segments using one response (variable) Y,and one predictor (variable) X.

[0071] Implementation of such (1) segmenting algorithms, (2) recursivepartitioning techniques or (3) a combination of (1) and (2) withoutundue experimentation is within the capability and understanding ofthose within the combined arts of computer science and statistics andneighboring arts (including computational methods of high throughputmolecular screening or drug screening) after reading this description(including the DPSA Appendix, which describes versions of fastsegmenting algorithms) and the references cited above.

[0072] For this application, we define a computer-based method orprocess that uses (1) one or more segmenting algorithms or (2) one ormore recursive partitioning methods or procedures or (3) a combinationof (1) and (2) as a Segmentation/Recursive Partitioning Process (orabbreviated as an S/RP P); wherein the meaning of segmenting algorithmis any meaning or similar meaning in any one of references 1 through 9inclusive and the DPSA Appendix. The term “recursive partitioning” iswell known in the art of computer science and arts cited above. Adiscussion of the term is found in the Introduction and early pages ofreference 2 and further definition and meaning of the term and it'scombination with segmentation techniques is found in reference 2.

[0073] For this application, we define a computer-based method orprocess that uses one or more segmenting algorithms as a segmentingprocess (or segmentation process), abbreviated as an SP. An SP is anS/RP Process.

[0074] Other Examples of S/RP Processes

[0075] There are other examples of S/RP Processes and similar processes.There are other versions of FIRM, some examples of other versions ofFIRM include CATFIRM and CONFIRM. There are other computer-based methodsthat are similar to FIRM such as AID, CHAID, DP.CHAID, CART,KnowledgeSEEKER, TREEDISC, and similar techniques.

[0076] Helpful in understanding and implementing FIRM is the FIRMmanual, Formal Inference-based Recursive Modeling. The latest version isrelease 2.2, 1999. This manual can be downloaded over the internet. Thismanual and software can also be ordered from the University of Minnesotabookstore for a nominal charge. The manual and software are incorporatedherein by reference to the fullest extent of the law.⁸ In addition,concepts useful for understanding and implementing versions of FIRM aredescribed in Chapter 5: Automatic Interaction Detection by Hawkins andKass pp. 269-302 in the book Topics in Applied Multivariate Analysis;Hawkins, D. H., Ed. Cambridge University Press.⁹ This chapter isincorporated herein by reference to the fullest extent of the law. (Themanual, software, and book chapter are not admitted to being prior artby their mention in this description.)

[0077] One or more of the S/RP Processes, or similar methods listedabove, or one or more similar processes not specifically listed use oneor more dynamic programming (DP) segmenting algorithms such as theHawkins DP segmenting algorithm, an FSA or similar algorithm.

[0078] As noted in the Background, Hawkins has developed a DP SegmentingAlgorithm (DPSA) for segmenting sequential data (see Reference 1, pp.390-391 and later in this Description section for more details). One ormore S/RP Processes, or similar methods use (or have the potential touse) this Hawkins DP (segmenting) algorithm, or one or more similarDPSAs (such as one or more FSAs) to segment data. As described above,some versions of FIRM accommodate or manipulate data that is “floating”or “free”, such as float or free predictors. In some cases, such as someversions of FIRM, an S/RP Process (or similar method) manipulates datainto an essentially sequential format wherein such essentiallysequential data is segmented by a Hawkins DP algorithm or similaralgorithm(s).

[0079] Also as noted in the Background, the inventor has invented FastSegmenting Algorithms (FSAs) that are much faster than the Hawkins DPalgorithm (and similar algorithms) especially when segmenting largeamounts of data. One or more S/RP Processes, or similar methods use (orhave the potential to use) one or more FSAs to segment real-world data.

[0080] Any computer based method that uses an FSA (or wherein the methodhas an FSA that is available for use) is a version of the invention.

[0081] Any computer based method (for example an S/RP Process or asimilar method) that uses one or more FSAs on any kind of data,including real-world data is a version of the invention. Any computerbased method (for example an S/RP Process or a similar method) whereinone or more FSAs is available for use by the method on any kind of data,including real-world data is a version of the invention.

[0082] Special Score Functions

[0083] The Binomial Score Function Professor Hawkins has also invented anovel measure of segment (or intra-segment) data homogeneity. This newscore function is particularly well suited for a univariate response,wherein the response has only two (or essentially only two values). Anexample of the kind of data for which this new score function isparticularly well suited is data wherein the (univariate, two-valued)response value(s) that are essentially associated with each segment (ofone or more segments) are essentially distributed according to abinomial distribution. More details on the Binomial Score Function(abbreviated BScore or BScore Function) and homogeneity measures derivedtherefrom are given in the DPSA Appendix.

[0084] Measures of Segment Data Homogeneity and of Overall Segment DataHomogeneity (for a split) that are derived from a Binomial ScoreFunction

[0085] One or more segment homogeneity measures (or score functions) arederived from a Bscore function as described in the DPSA Appendix. One ormore overall measures of segment homogeneity (for a split) are derivedfrom a Bscore function as described in the DPSA Appendix. A measure ofsegment data homogeneity or a measure of overall segment datahomogeneity (for a split) that is derived from a Binomial Score Functionis an Binomial derived Score Function, abbreviated BdScore or BdScoreFunction.

[0086] Bd-DPSAs

[0087] As described in the DPSA Appendix, one or more DPSAs have thepotential to use a BScore Function or a BdScore Function as a measure ofhomogeneity. A DPSA that uses a BScore Function or a BdScore Function isa Bd-DPSA.

[0088] Utility of the Pillai Trace Statistic for SegmentingMulti-variate (vector) Response Data

[0089] Professor Hawkins has also discovered a special utility for thePillai-Trace Statistic in segmenting multi-variate response data. Suchdata is equivalent to response data in vector form. As described in theDPSA Appendix, one or more DPSAs have the potential to use a scorefunction that is derived from the Pillai-Trace Statistic. A PTd-DPSA isa DPSA that uses a score function that is derived from the Pillai-TraceStatistic.

[0090] Special DPSAs An FSA, a Bd-DPSA, PTd-DPSA is a special DPSegmenting Algorithm or a special DPSA.

[0091] Any computer based method that uses (or has available for use) aspecial DPSA (FSA, Bd-DPSA, PTd-DPSA) is a version of the invention.

[0092] Any computer based method (for example an S/RP Process or asimilar method) that uses one or more special DPSAs on any kind of data,including real-world data is a version of the invention. Any computerbased method (for example an S/RP Process or a similar method) whereinone or more special DPSAs is available for use by the method on any kindof data, including real-world data is a version of the invention.

[0093] Data Obiects, Descriptors, Predictors and Responses

[0094] From a computational standpoint, a data object is arepresentation of an object. It is possible for the object that isrepresented by the data object to be an abstract object or a real-worldobject. Objects (both abstract and real) have characteristics. From acomputational standpoint, a descriptor is a characteristic of an objectthat is represented by a data object.

[0095] Just as a real world object frequently has more than onecharacteristic, a data object frequently has more than one descriptor.As described in more detail below, each descriptor of a data object hasa particular “descriptor value” for each descriptor of the dataobject.^(VIII) Also as described in more detail below, a descriptor isfrequently essentially quantitative or qualitative and has quantitativeor qualitative values respectively.

[0096] Generally a real-world data object represents a real-world objectsuch as for (nonlimiting) examples, an actual physical object, physicalphenomenon, a real-world phenomenon or phenomena, a physical datum ordata.

[0097] It is possible to conceptualize a first descriptor as beingpredicted by, correlated with or caused by a second descriptor for agroup of data objects. Under such a conceptualization, for example, thefirst descriptor is designated to be a response and the seconddescriptor is designated to be a predictor (of the response).^(IX) Thedesignation of a first descriptor as a response and a second descriptoras a predictor is somewhat arbitrary. This designation is arbitrary inthe sense that it is possible to designate the first descriptor as apredictor and the second descriptor as a response (of the predictor).

[0098] The terms predictor and descriptor are used essentiallyinterchangeably in this patent application. And the reader should bekeep in mind the frequent or somewhat arbitrary nature of thedistinction between a response and a predictor (or descriptor).

[0099] Versions of the Invention Segment Data Objects, Wherein the DataObjects Represent Abstract or Real-world Objects

[0100] Versions of the invention are methods for segmenting data objectsusing descriptor values, wherein a descriptor is designated a responseand one or more descriptors are designated as predictors. The dataobjects segmented by versions of the invention are abstract orreal-world objects.

[0101] Illustrative Example of the versions of the invention that handlereal-world data.

[0102] This description will now begin with an illustrative descriptionof versions of the invention that segment a particular kind ofreal-world data. This real-world data is in the field of high throughputscreening of candidate pharmaceuticals. Versions of the inventionsegment molecules, such as molecules that are drug candidates, using oneor more geometry-based molecular descriptors. (For versions of theinvention described herein, geometry-based descriptors are essentiallyequivalent to predictors.) As described in the background, multi-waysegmenting of molecular drug candidates using geometry based moleculardescriptors has been considered essentially impractical. Thus suchmulti-way segmenting of molecules (for example drug candidate molecules)using geometry based molecular descriptors is essentially novel andunobvious.

[0103] Molecules as Data Objects

[0104] Molecules are real world objects. For versions of the invention,a data object is a representation of a real world object. In thisapplication, the term molecular data object means a data object thatrepresents a molecule. Molecules have characteristics. Thesecharacteristics are both quantitative and qualitative in nature.Examples of quantitative molecular characteristics are various distancesbetween parts of a molecule or molecules, such as a distance between twoatoms in a molecule. An example of a qualitative molecularcharacteristic is the gross color of a large quantity of the molecule inpure, solid form (such as a powder).

[0105] Versions of the invention use segmenting algorithms and recursivepartitioning techniques (similar to those described in references 1through 7 inclusive above) and designate one or more of thecharacteristics of a group of molecules essentially as predictors; anddesignate a molecular characteristic of the group of moleculesessentially as a response. In this application the molecularcharacteristic that is designated as the response is referred to as amolecular property. (The terms predictor and response are usedessentially as in reference 2.) By doing this, versions of the inventionare essentially a method of predicting one or more molecular propertieson the basis of one or more molecular characteristics. Alternatively,versions of the invention elucidate correlation or causal typerelationships between one or more molecular characteristics and one ormore molecular properties. Versions of the invention do this bycharacterizing molecules as data objects, and molecular properties andcharacteristics as descriptors or molecular properties.

DESCRIPTORS

[0106] From a computational standpoint, it is possible to consider aparticular molecule as a data object. And it is possible to consider oneor more characteristics of a molecule as descriptors of a data objectthat represents the molecule.

[0107] Definition of a descriptor (for versions of the invention): Aqualitative or a quantitative characteristic of a data object. Aqualitative characteristic is a qualitative descriptor, a quantitativecharacteristic is a quantitative descriptor. An example of aquantitative descriptor is a person's age, the person being representedby a data object. An example of a qualitative descriptor is the odor ofa mushroom, the mushroom being represented by a data object. An exampleof a quantitative molecular descriptor is a distance between twomolecular parts. An example of a qualitative molecular descriptor is thecolor of a large quantity of the molecule in pure powder form.^(X)

[0108] Definition of the descriptor value for a data object (forversions of the invention): Each descriptor has a particular value for aparticular data object. The value being (1) a quantitative value for aquantitative descriptor or a qualitative value for a qualitativedescriptor or (2) the value being “missing” when a quantitative orqualitative value has not been determined. (Note that a quantitativedescriptor value is similar or essentially equivalent to a “montonicpredictor value” of reference 2. A qualitative descriptor value issimilar or essentially equivalent to a “free predictor value” ofreference 2. And the concept of a “float predictor value” of reference 2is similar or equivalent to a descriptor value that is “missing”.

[0109] An example of a value of a quantitative descriptor for an objectis the age “61 years” for a particular person. An example of a value ofa qualitative descriptor for an object is the odor “fishy” for aparticular mushroom. An example of a value of a qualitative moleculardescriptor is the color “white” for a particular molecular substance inpure powder form. An example of a value of a quantitative geometry-basedmolecular descriptor is the number of angstroms between two atoms of aparticular molecule when the molecule is in a particular conformationalstate.

[0110] The use of geometry-based molecular descriptors and S/RP Ptechniques in versions of the invention is novel and unobvious. Anycomputer-based method of segmenting two or more molecular data objectsusing one or more geometry-based molecular descriptors by utilizing oneor more S/RP Processes is a version of the invention. Thesegeometry-based molecular descriptors (both qualitative and quantitative)are described in more detail below.

[0111] Illustrative Examples of Versions of the Invention. Otherexamples of quantitative molecular descriptors (predictors) are atomclass pairs and various types of “through compound path lengths” betweenthe focal atoms of atom class pairs. This type of descriptor (orpredictor) is an example of a geometry-based molecular quantitativedescriptor. Example 1 is an illustrative example of a version of theinvention that makes use of such descriptors (or predictors). For moredetails on the particular descriptors used (path length low and high,PLLO and PLHI between atom class pairs) see Example 1.

[0112] Some further details on geometry-based molecular descriptors (orpredictors)

[0113] In example 1, the molecular features used are atom class pairs,and the focal atoms of each atom class in the pair were the distancemeasurement endpoints. Each geometry-based descriptor (or predictor)depends on one or more molecular features.

[0114] Molecular features include, but are not limited to, atoms, amolecular part or parts, functional groups, surface regions, quantummechanical representations of a molecular part or parts, field or chargerepresentations of a molecular part or parts, elements of protein,peptide, DNA, RNA, biopolymer, or polymer sequences. For somegeometry-based quantitative descriptors, the value of each descriptor isdetermined by using one or more distance measurement endpoints. Eachdistance measurement endpoint of a molecular feature is a point on orwithin a molecular feature. It is possible for any point that is on orwithin a molecular part to be used as a distance measurement endpoint. Amolecular feature separation distance is a distance measurement (orvalue) between two or more distance measurement endpoints. Examples ofmolecular feature separation distances include (but are not limited to)distance measurements (or values) between two centroids, nearestdistances between two molecular features (or parts), farthest distancesbetween two molecular features (or parts), and the shortest or longest(through space or through compound) connected path length between two ormore distance measurement endpoints. A geometry-based quantitative orqualitative descriptor includes (but is not limited to) any descriptorwhose descriptor value is determined in whole or in part by one or moremolecular feature separation distances.

[0115] Examples of molecular feature separation distances includethrough compound path lengths, which are integer distances in a graphrepresentation of a molecule. Two or three dimensional spatialrelationship distances also constitute examples of molecular featureseparation distances. Examples of such two and three dimensional spatialrelationship distances are the low or the high distance in angstromsbetween atom class pairs across one or more of multiple conformations ofa molecule.

[0116] In addition to using distance between measurement endpoint pairsto determine geometry-based molecular descriptor values, ageometry-based descriptor is any descriptor whose value is determined inwhole or in part by a geometry-based metric (or measure). An example ofsuch a metric is any metric that is derived from a combination ofdistances between two or more measurement endpoints. Other examples ofsuch a geometry-based metric include a measure (or value) of any area orvolume circumscribed or bounded by two or more distance measurementendpoints of molecular features. Geometry-based metrics includenon-Euclidean distance metrics. Geometry-based metrics also includemeasures of distance that are computed in the dual plane (a concept fromcomputational geometry).

[0117] A geometry-based metric includes any mathematical function,calculation, or the equivalent thereof that uses any of the distances ormetrics mentioned above singly or in combination. Such mathematicalfunctions or calculations include, but are not limited to, statisticalfunctions. These include such statistical functions as mean, median,mode and other higher order statistical functions or measures. Examplesof qualitative geometry-based molecular descriptors (or predictors) areone or more molecular features (or one or more measurement endpoints)that are essentially concave or convex, colinear, planar orcoplanar.^(XI)

[0118] Versions of the invention have a very wide range ofapplicability. Versions of the invention have operability and utilityfor molecules that are not man-made or are extracts or modifications ofnatural substances. Therefore, a descriptor of any molecule that isencompassed in the definitions, discussion or description of ageometry-based molecular descriptor in essentially any branch ofchemistry, or related discipline is an example of a geometry-basedmolecular descriptor.

[0119] The distinction between a molecular property (or response) and amolecular descriptor (or predictor) is essentially arbitrary in that itis possible to designate a descriptor(or predictor) as a property (orresponse).

[0120] A molecular property (response) includes (but is not necessarilylimited to) (1) any measurable, inferable or observable physical,chemical, or biological property of a molecule. In addition, any (2)molecular descriptor as described above (including predictor typedescriptors) is a molecular property (3) any combination of one or moreof the properties as described in (1 ) and (2) of this paragraph is amolecular property. Any mathematical computation that uses one or moreof the properties as described in (1), (2) and (3) of this paragraph isa molecular property.

[0121] Any property of a drug, molecule or molecular substance used inany phase of the pharmaceutical industry is an example of a molecularproperty. Such phases include research, development, testing,manufacture or usage of a drug or other molecule or molecular substancein the pharmaceutical industry. A molecular property includes anyproperty used in animal, cell line or human studies. Any property of anauxiliary molecule (that is not the, or a principally active compound)such as a molecule that is part of a drug delivery system is an exampleof a molecular property. Any combination of one or more such molecularproperties is also a molecular property.

[0122] Molecular properties include (but are not limited to) drugpotency, drug toxicity, solubility, drug absorption profile, positive ornegative drug effects. A molecular property is any drug effect on one ormore individuals that is associated with one or more descriptions (ordescriptors) of the genetic make-up of the one or more individuals. AnyADMET property is a molecular property. The distinction between aproperty (response) and a descriptor is somewhat arbitrary in that it ispossible to designate a descriptor as a property (response).^(XII)^(XII) Versions of the invention have utility and operability in manyareas of chemistry outside the pharmaceutical industry as well.Therefore, for example, any property of a molecular substance used inany phase of the chemical industry is an example of a molecularproperty. Such phases include (but are not limited to) research,development, testing, manufacture or usage of a molecule in the chemicalindustry. Any combination of one or more such molecular properties isalso a molecular property. Versions of the invention have a very widerange of applicability. Versions of the invention have operability andutility for molecules that are not man-made or are extracts ormodifications of natural substances.

[0123] (Molecule: In this application, the term molecule is used in theterm's broadest possible sense. It is also possible for the termmolecule to mean a complex of one or more molecules (wherein the term“molecules” is used in the term's usual sense) that are in closeproximity. As is evident, versions of the invention have utility andoperability in the study of some such molecular complexes that arewithin the meaning of the term “molecule” as used in this application.)

[0124] Some further details on molecular descriptors, properties, andother descriptors can be had in Molecules 2002, 7, 566-600; AnIntroduction to QSAR Methodology by Richon and Young, (Network Science),http://www.netsci.org/Science/Compchem/feature19.html, published Oct.1997; Chemometrics and Intelligent Laboratory Systems 60 (2002), pp.5-11; Goodman & Gilman's The Pharmacological Basis of Therapeutics ISBN:0071354697, each of these four publications is incorporated herein byreference to the fullest extent of the law.

[0125] Further General Description of Versions of the Invention

[0126] General and more specific descriptions of versions of theinvention are given below. Some versions of the invention describedbelow are-for (or handle) essentially any kind of data or data objects,including abstract data or data objects, or real-world data or dataobjects. Some versions of the invention described below are morespecifically for (or handle) essentially molecular data (such asgeometry-based molecular descriptors) or molecular data objects.

[0127] I. Some Versions of Simple Segmentation Processes

[0128] Some simple versions of the invention use one segmentingalgorithm, one property (response) and one descriptor to segment a groupof data objects:

[0129] As described above in references 1-9 inclusive and the DPSAAppendix, a segmenting algorithm essentially segments data, so that theY values (or response values) that correspond to each segment areessentially homogeneous. Each segment is an interval or a grouping of Xvalues (or descriptor values). And each Y value (or response value) isassociated with an X value (or descriptor value). Thus, in a simpleusage, a segmenting algorithm segments using one response Y, and onedescriptor X.

[0130] In this application, a property (for example a molecularproperty) is equivalent to Y (or the response); and a descriptor (forexample a molecular descriptor) is equivalent to X (or the descriptor).And each property (response) value corresponds to a descriptor value inthat the property (response) value and descriptor value belong to thesame data object. Thus, a segmenting algorithm (in the context ofversions of the invention) essentially segments a group of data objectsso that the property (response) values within each descriptor valuesegment are essentially homogeneous. Thus, in a simple usage (forversions of the invention), a segmenting algorithm segments using oneproperty (response), and one descriptor.

[0131] Continuing, a segmenting algorithm (in the context of someversions of the invention) essentially segments a group of molecules (ormolecular data objects) so that the molecular property (response) valueswithin each segment are essentially homogeneous; and each segment isessentially an interval of values for a single molecular descriptor.Thus, in a simple usage (for some versions of the invention), asegmenting algorithm segments a group of molecules using one molecularproperty (response), and one molecular descriptor.

[0132] Usage of geometry-based molecular descriptor(s) and segmentingalgorithms for segmenting

[0133] A process (or apparatus) that essentially segments a group ofmolecules using one or more segmenting algorithms and one or moregeometry-based molecular descriptors is a version of the invention. Suchinventions and related inventions have been invented by the inventor andare described in this application.

[0134] A general description of a version of such inventions is asfollows.

[0135] SSP#1 A computer-based method of segmenting a group of two ormore data objects into two or more subgroups using a segmentingalgorithm and a descriptor and a response, comprising:

[0136] obtaining a value for the descriptor and the response for eachobject in the group; and

[0137] segmenting the data objects in the group into two or moresubgroups using the segmenting algorithm, the response value and thedescriptor value for each object in the group.

[0138] An example of the segmenting algorithm of SSP#1 is a DPSA, aspecial DPSA (FSA, a Bd-DPSA, PTd-DPSA). Versions of SSP#1 handle anykind of data or data objects including real-world data, such asmolecular data and data objects, and geometry-based descriptors.Versions of SSP#1 output data in segmented form to a monitor, LCD,printer or equivalent device for use by a human user or users. Anycomputer-based method that uses data in segmented form from a version ofSSP#1, wherein the method essentially outputs data to a monitor orequivalent device is a version of the invention. More specific versionsof SSP#1 handle essentially molecular data or molecular data objects. Adescription of some more specific such versions is as follows:

[0139] SMSP#1 A computer-based method of segmenting a group of two ormore data objects as SSP#1, wherein each of the data objects is amolecular data object, the response is a molecular property and thedescriptor is a geometry-based descriptor.

[0140] Versions of SMSP#1 output data in segmented form to a monitor,LCD, printer or equivalent device for use by a human user or users. Anycomputer-based method that uses data in segmented form from a version ofSMSP#1, wherein the method essentially outputs data to a monitor orequivalent device is a version of the invention.

[0141] A collection of two or more subgroups of data objects, whereinthe subgroups were generated by a segmentation process that segmented agroup is defined as a segmentation of the group of data objects.

[0142] Segmenting a group of data objects (such as molecular dataobjects) using more than one descriptor or more than one segmentingalgorithm also has utility. In order to further describe these type ofversions of the invention, it is helpful to examine (1) groups of dataobjects having more than one descriptor; and it is helpful to examine(2) segmenting algorithms in further detail as well. We start with (1)first.

[0143] (1)Groups of data objects having more than one descriptor: A veryhelpful (nonlimiting) way to conceptualize a group (or subgroup) of dataobjects, the objects' property (response) and descriptor (or predictor)values is the matrix shown in FIG. 3. The objects are denoted O₁, . . ., O_(n). The objects'property (response) is denoted as P and theobjects' descriptors (or predictors) are denoted as D₁, . . . , D_(M).The particular value (quantitative or qualitative) of a particularobject's property (or descriptor) is indicated in the matrix cellcorresponding to the object and property (or descriptor). In cases inwhich it is not possible to ascertain the particular value of adescriptor, the value is indicated in the matrix as “missing”. (Thisconceptualization is essentially applicable to any group or subgroup ofdata objects.)

[0144] (2)A further discussion of segmenting algorithms. Segmentingalgorithms have been discussed above and in the references 1-9, and theDPSA Appendix. Characteristics of a segmenting algorithm often includea) one or more measures of homogeneity/inhomogeneity. These measuresinclude (1) a measure of the homogeneity of property (response) valuesfor the data objects in each segment or (2) an overall (for all thesegments combined) measure of the homogeneity of the property (response)values in each segment or (3) a measure of the inter-segment property(response) value inhomogeneity (such a measure is pair-wise or for anycombination of two or more segments) or (4) any combination of (1), (2)or (3). Examples of such measures of homogeneity or inhomogeneity arefound in the references 1-9, and the DPSA Appendix and includestatistical measures, least square type measures and deviance measures.Other characteristics of a segmenting algorithm often include a) one ormore stop criteria (defined below), b) the manner in which it chooses afinal segmentation from various possible segmentations, c) the number ofsegments it generates in segmenting, d) the manner in which it performsits calculations. This list, a) through d) is not necessarilyexhaustive. It is not necessary for a segmenting algorithm to choose abest or even approximately best segmentation, although such a choice orchoices have utility. “Best segmentation” means best in terms of one ormore measures of homogeneity/inhomogeneity or similar measures.

[0145] A stop criterion is a criterion that tells the algorithm thatthere is (1) no acceptable potential segmentation or segmentations or(2) to stop seeking a potential (or candidate) segmentation orsegmentations or a similar such criterion or criteria. Examples include(but are not limited to) when potential segmentations have too fewobjects in one or more segments or the measure ofhomogeneity/inhomogeneity between segments of potential segmentations islow, for example statistically insignificant. In this patent applicationthe term “stop criterion” (or “stop criteria”) includes any stopcriterion (or criteria) known to a person of ordinary skill in the artof data segmenting, or recursive partitioning or neighboring art(s).

[0146] II. Some Versions of Node Segmentation Processes

[0147] Segmentation processes that segment a group of data objects (suchas molecular data objects) using one or more segmenting algorithms andone or more descriptors (such as one or more descriptors that are ageometry-based descriptors).

[0148] A process that produces a segmentation of a group of data objectsby generating one or more candidate segmentations of the group using oneor more descriptors and using one or more segmenting algorithms and thatelects one of the candidate segmentations as a final segmentation isreferred to as a node segmentation process or as a group segmentationprocess. (A node or group segmentation process that uses one or moregeometry-based descriptors (or predictors) to segment a group containingone or more molecular data objects is a version of the invention.)Inherent in the election of a final segmentation by a node segmentationprocess is the possible use of the elected segmentation for generationof one or more daughter nodes; wherein each daughter node corresponds toa segment in the elected segmentation. And in effect, the daughter nodesand the (original) node constitute a nodal tree, wherein the (original)node is the parent node of each of the daughter nodes. (The term groupof data objects and node of data objects is used somewhatinterchangeably in this patent application.) As the name implies, acandidate segmentation is a segmentation that could be elected by aprocess as a (or the) final segmentation.

[0149] Description of a version of a node segmentation process

[0150] A description of a version of a node or group segmentationprocess is as follows.

[0151] GenNSP#1 A computer-based method of segmenting a group (or node)of two or more data objects into two or more subgroups, wherein eachdata object has a response value and a value for each of one or moredescriptors, comprising:

[0152] choosing one or more segmenting algorithms for each descriptorand generating one or more candidate segmentations for each descriptor;and

[0153] electing one of the candidate segmentations as a finalsegmentation and designating all of the data objects in each of one ormore segments of the final segmentation as a subgroup.

[0154] The term “node segmentation process” is abbreviated as NSP. Aswith some segmenting algorithms, a node segmentation process does notnecessarily elect a best or approximately best segmentation. Someversions of segmentation processes elect a statistically meaningfulsegmentation. Versions of an NSP that elect a best or approximate bestsegmentation have definite utility and are preferred versions. Unlessspecifically stated otherwise some embodiments of each version ofGenNSP#1 (or NSP described herein) handle any kind of data or dataobjects including real-world data, such as molecular data and dataobjects, and geometry-based descriptors. Unless specifically statedotherwise some embodiments of each version of GenNSP#1 (or NSP describedherein) output data in segmented form to a monitor, LCD, printer orequivalent device for use by a human user or users. Any computer-basedmethod that uses data in segmented form from a version of GenNSP#1 (orNSP described herein), wherein the method essentially outputs data to amonitor or equivalent device is a version of the invention.

[0155] Segmenting algorithms of an NSP

[0156] It is possible for some versions of GenNSP#1 to choose one ormore segmenting algorithms for each descriptor so that for each of oneor more descriptor pairs, the one or more algorithms for each descriptorof each descriptor pair are different. In some cases, for technicalconvenience and efficiency, the same one or more segmenting algorithmsare chosen for each descriptor. An example of the segmenting algorithmof GenNSP#1 is a DPSA, an FSA, a Bd-DPSA. FSAs are fast enough that theyallow human interaction with one or more NSPs.

[0157] More specific versions of GenNSP#1 handle essentially moleculardata or molecular data objects. A description of some more specific suchversions of GenNSP#1 are as follows: GenMNSP#1 A computer-based methodof segmenting a group of two or more data objects as GenNSP#1, whereineach of the data objects is a molecular data object, the response is a(molecular) property and one or more of the descriptors is ageometry-based (molecular) descriptor.

[0158] GenMNSP#2 A computer-based method of segmenting a group of two ormore data objects as GenNSP#1, wherein each of the data objects is amolecular data object, the response is a (molecular) property, whereineach data object has a value for each of two or more descriptors, andwherein one or more of the descriptors is a geometry-based (molecular)descriptor.

[0159] FSA or Bd-DPSA Capable Node Segmentation Processes

[0160] A specific kind of node segmentation process is a process wherein(1) one or more FSAs, or (2) one or more Bd-DPSAs, or a combination ofboth (1) and (2) is available to use to segment one or more nodes. Sucha node segmentation process essentially chooses one or more segmentingalgorithms from a battery (or group) of one or more segmentingalgorithms, wherein one or more of the segmenting algorithms in thebattery is an FSA or a Bd-DPSA. Such a node segmentation process is anFSA or Bd-DPSA Capable Node Segmentation Process. A description of aversion of an FSA or Bd-DPSA Capable Node Segmentation Process is asfollows.

[0161] FSAorBd-DPSACapable NSP#1 A computer-based method of segmenting agroup (or node) of two or more data objects into two or more subgroups,wherein each data object has a response value and a value for each ofone or more descriptors, comprising:

[0162] choosing one or more segmenting algorithms for each descriptorfrom a battery of one or more segmenting algorithms, wherein one or moreof the algorithms in the battery is an FSA or a Bd-DPSA and generatingone or more candidate segmentations for each descriptor; and

[0163] electing one of the candidate segmentations as a finalsegmentation and designating all of the data objects in each of one ormore segments of the final segmentation as a subgroup.

[0164] For some versions of FSAorBd-DPSACapable NSP#1 the algorithmbattery includes one or more FSAs and one or more Bd DPSAs.

[0165] More specific versions of FSAorBd-DPSACapable NSP#1: For somemore specific versions of FSAorBd-DPSACapable NSP#1 the battery ofalgorithms is limited so that one or more of the algorithms in thebattery is either (1) an FSA or (2) a Bd-DPSA. In (1) the nodesegmentation process is an FSA capable node segmentation process; in (2)the segmentation process is a Bd-DPSACapable node segmentation process.A description of some such versions of FSAorBd-DPSACapable NSP#1 is asfollows.

[0166] (1) FSACapable NSP#1 A computer-based method of segmenting agroup (or node) of data objects as FSAorBd-DPSACapable NSP#1, whereinone or more of the algorithms in the battery is an FSA.

[0167] (2) Bd-DPSACapable NSP#1, computer-based method of segmenting agroup (or node) of data objects as FSAorBd-DPSACapable NSP#1, whereinone or more of the algorithms in the battery is a Bd-DPSA.

[0168] More specific versions of FSAorBd-DPSACapable NSP#1 for moleculardata or data objects. More specific versions of FSAorBd-DPSACapableNSP#1 handle essentially molecular data or molecular data objects. Adescription of some more specific such versions of FSAorBd-DPSACapableNSP#1 is follows:

[0169] FSAorBd-DPSACapable Mol NSP#1: A computer-based method ofsegmenting a group (or node) of data objects as any one of the methodsFSAorBd-DPSACapable NSP#1, FSACapable NSP#1, or Bd-DPSACapable NSP#1,wherein each data object is a molecular data object, wherein theresponse is a molecular property, wherein each descriptor is a moleculardescriptor, and wherein one or more of the descriptors is ageometry-based molecular descriptor.

[0170] Special NSPs

[0171] Any NSP that uses or has available for use a special DPSA is aspecial NSP. Versions of NSPs that use or have available for use FSAs orBd-DPSAs have been described. Similarly, any similar NSP that uses orhas available for use one or more PTd-DPSAs is a special NSP.

[0172] Special using NSPs: An NSP that uses one or more FSAs is an FSAusing NSP. An NSP that uses one or more Bd-DPSAs is a Bd-DPSA using NSP.An NSP that uses one or more PTd-DPSAs is a PTd-DPSA using NSP.

[0173] Special capable NSPs: An NSP that has one or more FSAs availablefor use is an FSA capable NSP. An NSP that has one or more Bd-DPSAsavailable for use is a Bd-DPSA capable NSP. An NSP that has one or morePTd-DPSAs available for use is a PTd-DPSA capable NSP.

[0174] Particular Special NSPs: An NSP that uses, or has available foruse, one or more FSAs is an FSA special NSP. An NSP that uses, or hasavailable for use, one or more Bd-DPSAs is a Bd-DPSA special NSP. An NSPthat uses or has available for use one or more PTd-DPSAs is a PTd-DPSAspecial NSP.

[0175] Human Interaction NSPs

[0176] In addition, some versions of NSPs essentially allow humaninteraction in that a human operator (1) chooses one or more of thedescriptors of the method (that an NSP uses to generate the one or morecandidate segmentations), (2) gives a command for an NSP to selectdescriptors (for use in segmenting); for some versions of NSPs, theselection is a random selection of descriptors (3) elects one of thecandidate segmentations as a final segmentation, or (4) chooses one ormore of the segmenting algorithms used by the method or (5) acombination of two or more of (1), (2) or (3) of this paragraph.

[0177] A description of some examples of such versions of the inventionare as follows. Each of these versions is an example of a HumanInteraction NSP(abbreviated HI-NSP). Any NSP that includes humaninteraction or is similar to one of the HI-NSPs recited herein(HI#1-GenNSP#1, HI#2-GenNSP#1, HI#3-GenNSP#1, HI#4-GenNSP#1,HI#1-GenNSP#2, or RandHI#1-GenNSP#2), is an HI-NSP.

[0178] HI#1-GenNSP#1 A computer-based method of segmenting a group (ornode) of two or more data objects into two or more subgroups as any oneof the methods GenNSP#1, GenMNSP#1, GenMNSP#2, FSAorBd-DPSACapableNSP#1, FSACapable NSP#1, Bd-DPSACapable NSP#1 method,FSAorBd-DPSACapable Mol NSP#1, wherein one or more of the descriptors ischosen by a human operator.

[0179] HI#2-GenNSP#1 A computer-based method of segmenting a group (ornode) of two or more data objects into two or more subgroups as any oneof the methods GenNSP#1, GenMNSP#1, GenMNSP#2, FSAorBd-DPSACapableNSP#1, FSACapable NSP#1, Bd-DPSACapable NSP#1method, FSAorBd-DPSACapableMol NSP#1, wherein the electing of the final segmentation uses one ormore commands from a human operator.

[0180] HI#3-GenNSP#1 A computer-based method of segmenting a group (ornode) of two or more data objects into two or more subgroups as any oneof the methods GenNSP#1, GenMNSP#1, GenMNSP#2, FSAorBd-DPSACapableNSP#1, FSACapable NSP#1, Bd-DPSACapable NSP#1 method,FSAorBd-DPSACapable Mol NSP#1, wherein a human operator selects aparticular candidate segmentation, and wherein the electing of theparticular candidate by the method as the final segmentation uses one ormore commands from the operator.

[0181] HI#4-GenNSP#1 A computer-based method of segmenting a group (ornode) of two or more data objects into two or more subgroups as any oneof the methods GenNSP#1, GenMNSP#1, GenMNSP#2, FSAorBd-DPSACapableNSP#1, FSACapable NSP#1, Bd-DPSACapable NSP#1 method,FSAorBd-DPSACapable Mol NSP#1, wherein the choosing of one or moresegmenting algorithms for each of one or more descriptors by the methoduses one or more commands from a human operator.

[0182] HI#1-GenNSP#2 A computer-based method of segmenting a group (ornode) of two or more data objects into two or more subgroups, whereineach data object has a response value and a value for each of one ormore descriptors, comprising:

[0183] receiving one or more commands from a human user to select one ormore of the descriptors, and selecting a subset of the descriptors;

[0184] choosing one or more segmenting algorithms for each descriptor inthe subset and generating one or more candidate segmentations for eachin descriptor in the subset; and

[0185] electing one of the candidate segmentations as a finalsegmentation and designating all of the data objects in each of one ormore segments of the final segmentation as a subgroup.

[0186] RandHI#1 -GenNSP#2 A computer-based method of segmenting a group(or node) of two or more data objects into two or more subgroups as themethod HI#1-GenNSP#2, wherein the subset is a randomly selected subsetof the descriptors.

[0187] Stop Criteria and NSPs

[0188] As described above, some segmenting algorithms use one or morestop criteria to stop segmenting. Versions of NSPs choose one or moresegmenting algorithms to achieve a final segmentation (or split) of anode. Each of one or more versions of an NSP chooses one or moresegmenting algorithms to segment a node, wherein each of the one or morealgorithms use one or more stop criteria. Thus it is possible for a nodeto meet one or more stop criteria of one or more segmenting algorithmschosen by each of one or more NSPs.

[0189] III. Some Versions of Processes that Generate Nodal Trees.

[0190] Use of one or more node segmentation processes to generate anodal tree. By using such a node segmentation process on an initialgroup (or root node) of molecular data objects and applying one or moresuch node segmentation processes recursively (wherein only one processis used on one node) to zero or more descendant nodes, a nodal tree isgenerated. Such a nodal tree is similar to nodal trees discussed earlierin this application.

[0191] Description of a Version of a Segmenting Nodal Tree GenerationProcess

[0192] GenSNTGP#1 A computer-based method for clarifying a relationshipbetween a response and one or more descriptors by generating a datastructure, the response and each descriptor having a value for each dataobject of a group of data objects, the data structure being a nodal treeor an equivalent thereof, the root of the tree being the group of dataobjects, comprising:

[0193] defining a nodal tree-node segmenting procedure, comprising i),ii), iii), iv):

[0194] i)choosing an unsegmented node that has not been previouslysegmented;

[0195] ii) choosing a node segmentation process for the unsegmentednode;

[0196] iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and

[0197] iv)making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv);

[0198] applying the nodal tree-node segmenting procedure to the rootnode first; and applying the nodal tree-node segmenting procedurerecursively to zero or more unsegmented nodes of the tree.

[0199] Description of some versions of a segmenting nodal treegeneration process that utilize one or more stop criteria.

[0200] In addition, it is possible to practice a process such asGenSNTGP#1 until one or more stop criteria are met for one or morenodes. The nature of a stop criterion or criteria were previouslydiscussed. An example of such a version of the invention is describednext and illustrated in FIG. 4.

[0201] Description of a version of a segmenting nodal tree generationprocess that uses one or more stop criteria

[0202] SNTGP#2 A computer-based method for clarifying a relationshipbetween a response and one or more descriptors by generating a datastructure, the response and each descriptor having a value for each dataobject of a group of data objects, the data structure being a nodal treeor an equivalent thereof, the root of the tree being the group of dataobjects, comprising:

[0203] defining a nodal tree-node segmenting procedure, comprisingi),ii), iii), iv):

[0204] i)choosing an unsegmented node that has not been previouslysegmented;

[0205] ii) choosing a node segmentation process for the unsegmentednode;

[0206] iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and

[0207] iv)making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv);

[0208] applying the nodal tree-node segmenting procedure to the rootnode first; and applying the nodal tree-node segmenting procedurerecursively to one or more unsegmented nodes of the tree until one ormore stop criteria are met for each of one or more unsegmented nodes.

[0209] Further description of each of the methods GenSNTGP#1 andGenSNTGP#2. (In the above descriptions of GenSNTGP#1 and GenSNTGP#2, theindices i), ii), iii), and iv) are included only for the purpose ofclarity. The indices i), ii), iii), and iv) are nonlimiting and do notnecessarily limit the method to a step method, or to a specific step orsteps, or to a specific number or order of steps.) Unless specificallystated otherwise some embodiments of each version of GenSNTGP#1 andGenSNTGP#2 (or any process that generates a nodal tree described herein)handle any kind of data or data objects including real-world data, suchas molecular data and data objects, and geometry-based descriptors.Unless specifically stated otherwise some embodiments of each version ofGenSNTGP#1 and GenSNTGP#2 (or any process that generates a nodal treedescribed herein) output data in segmented form to a monitor, LCD, CRT,printer or equivalent device for use by a human user or users. Anycomputer-based method that uses data in segmented form from a version ofGenSNTGP#1 (or any process that generates a nodal tree describedherein), wherein the method essentially outputs data to a monitor orequivalent device is a version of the invention.

[0210] Node Segmentation Process or Processes chosen by the nodaltree-node segmenting procedure of each of the methods GenSNTGP#1 andGenSNTGP#2. GenSNTGP#1 uses a nodal tree-node segmenting procedure. (Theabbreviation of nodal tree-node segmenting procedure is NT-NSPrcdr.) InGenSNTGP#1, an node segmentation process (an NSP) is chosen by theNT-NSPrcdr in ii) one or more times. Specifically an NSP is chosen foreach unsegmented node that is split (or segmented) by the NT-NSPrcdr.For some versions of GenSNTGP#1, two NSPs chosen for each of one or morepairs of unsegmented nodes^(XIII) are two differing NSPs. So it ispossible for a version of GenSNTGP#1 to essentially choose severaldiffering NSPs. In some situations, for purposes of technicalconvenience, essentially the same NSP is chosen for each unsegmentednode that is split by GenSNTGP#1. The description and details of thisparagraph with respect to NSPs are also true of GenSNTGP#2. It ispossible for each version of GenSNTGP#1 and GenSNTGP#2 to choose eachversion of an NSP described in this document one or more times. Versionsof NSPs described herein include special NSPs and HI-NSPs. A NT-NSPrcdr,wherein the procedure chooses one or more special NSPs is a specialNT-NSPrcdr. Such an NT-NSPrcdr uses one or more special NSPs. A specialNT-NSPrcdr uses or has available for use, one or more special NSPs. Andtherefore, a special NT-NSPrcdr effectively uses^(XIV) one or morespecial DPSAs. ^(XIII) An NSP is chosen for each (pair) of twounsegmented nodes that are acted on by the NT-NSPrcdr, for a total oftwo chosen NSPs (possibly different) for each pair of nodes. ^(XIV) Theterm “effectively uses” means that a special DPSA is effectively used oreffectively available for use by the procedure. Such effective use oreffective available use by the procedure is through the one or morespecial NSPs that the procedure uses or through the one or more specialNSPs are available for use by the procedure.

[0211] Using NT-NSPrcdrs

[0212] A special NT-NS Prcdr wherein the procedure chooses one or morespecial NSPs is a special-using NT-NS Prcdr. Some such versions ofspecial using NT-NS Prcdrs have been described above. A special-usingNT-NS Prcdr that chooses one or more FSA special NSPs is an FSA-usingNT-NS Prcdr. A special-using NT-NS Prcdr that chooses one or moreBd-DPSA special NSPs is an Bd-DPSA-using NT-NS Prcdr. A special-usingNT-NS Prcdr that chooses one or more PTd-DPSA special NSPs is anPTd-DPSA -using NT-NS Prcdr.

[0213] Able NT-NSPrcdrs

[0214] A nodal tree-node segmenting procedure wherein one or morespecial NSPs is available for use by the procedure is a special-ableNT-NS Prcdr. (A special-able NT-NS Prcdr is also a special NT-NS Prcdr.)Such a special-able NT-NS Prcdr essentially chooses one or more specialNSPs from an ensemble (or group) of one or more NSPs, wherein one ormore of the NSPs in the ensemble is a special NSP. Such a special-ableNT-NS Prcdr essentially has available an ensemble that includes one ormore special NSPs. A special-able NT-NS Prcdr that has available anensemble with one or more FSA special NSPs is an FSA-able NT-NS Prcdr. Aspecial-able NT-NS Prcdr that has available an ensemble with one or moreBd-DPSA special NSPs is an Bd-DPSA-able NT-NS Prcdr. A special-ableNT-NS Prcdr that has available an ensemble with one or more PTd-DPSAspecial NSPs is an PTd-DPSA-able NT-NS Prcdr.

[0215] Particular special NT-NS Prcdrs As is clear from the abovedescription, a special NT-NS Prcdr effectively uses one or more specialDPSAs. A special NT-NS Prcdr wherein the procedure is an FSA-able NT-NSPrcdr or an FSA-using NT-NS Prcdr is an FSA-special NT-NS Prcdr. AnFSA-special NT-NS Prcdr, effectively uses one or more FSAs.

[0216] A special NT-NS Prcdr wherein the procedure is an Bd-DPSA-ableNT-NS Prcdr or an Bd-DPSA-using NT-NS Prcdr is an Bd-DPSA-special NT-NSPrcdr. An Bd-DPSA NT-NS Prcdr, effectively uses one or more Bd-DPSAs. Aspecial NT-NS Prcdr wherein the procedure is an PTd-DPSA-able NT-NSPrcdr or an PTd-DPSA -using NT-NS Prcdr is an PTd-DPSA-special NT-NSPrcdr. An PTd-DPSA NT-NS Prcdr, effectively uses one or more PTd-DPSAs.

[0217] A more formal description of (1) a special-able NT-NS Prcdr and(2) a nodal tree generating (or growing) process that uses thespecial-able NT-NS Prcdr

[0218] A process similar to GenSNTGP#1 that uses a special-able NT-NSPrcdr, specifically an FSA-able NT-NS Prcdr is described below.

[0219] FSA-able GenSNTGP#1 A computer-based method for clarifying arelationship between a response and one or more descriptors bygenerating a data structure, the response and each descriptor having avalue for each data object of a group of data objects, the datastructure being a nodal tree or an equivalent thereof, the root of thetree being the group of data objects, comprising:

[0220] defining a nodal tree-node segmenting procedure, comprising i),ii), iii), iv):

[0221] i)choosing an unsegmented node that has not been previouslysegmented;

[0222] ii) choosing a node segmentation process for the unsegmented nodefrom an ensemble of one or more NSPS, wherein one or more of theensemble NSPs is an FSA special NSP;

[0223] iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and

[0224] iv) making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv);

[0225] applying the nodal tree-node segmenting procedure to the rootnode first; and applying the nodal tree-node segmenting procedurerecursively to zero or more unsegmented nodes of the tree.^(XV)

[0226] It is also possible to practice a method such as FSA-ableGenSNTGP#1 above with an additional data gathering step or step-likepart. For example: A method such as FSA-able GenSNTGP#1, wherein one ormore of the data objects is a real-world object, further comprising:collecting one or more descriptor values or one or more property valuesof each of one or more of the real-world objects by physical measurementor observation.

[0227] Human Interaction in some versions of each of GenSNTGP#1 andGenSNTGP#2

[0228] An important feature of some versions of each of the methodsGenSNTGP#1 and GenSNTGP#2, is that human interaction/intervention is apart of growing the nodal tree. One way that human interaction is partof versions of these tree growing processes is through one or more HumanInteraction NSPs chosen in ii) by the processes.

[0229] For some versions of each of GenSNTGP#1 and GenSNTGP#2, the nodaltree grown by each method is a subtree of a larger (previouslygenerated) nodal tree, and the root node of the grown nodal tree (grownby each method) is a daughter node of the larger tree. This situation isa situation for which human interaction in growing a nodal tree byversions of each method (GenSNTGP#1 and GenSNTGP#2) is important.

[0230] A more formal description of versions of tree generating (orgrowing) processes that include human interaction is as follows.

[0231] HIGenSNTGP#1 A computer-based method for clarifying arelationship between a response and one or more descriptors bygenerating a data structure, as in any one of the methods GenSNTGP#1 orGenSNTGP#1, wherein the nodal tree-node segmenting procedure chooses oneor more Human Interaction NSPs, one or more times.

[0232] Versions of each of GenSNTGP#1 and GenSNTGP#1 handle essentiallyany kind of data or data objects, including real-world data orreal-world data objects, some more specific versions of each ofGenSNTGP#1 and GenSNTGP#1 handle molecular data or molecular dataobjects.

[0233] A description of some such versions of the invention is asfollows.

[0234] MoISNTGP#1 A computer-based method for clarifying a relationshipbetween a response and one or more descriptors as any one of the methodsGenSNTGP#1 or GenSNTGP#1, wherein each of one or more data objects is amolecular data object, wherein each of one or more of the descriptors isa geometry-based molecular descriptor, and wherein the response is amolecular property.

[0235] An example of MoISNTGP#1 is a method as MoISNTGP#1 wherein eachof the data objects is a molecular data object.

EXAMPLE 1

[0236] An illustrative example of a version of the invention makes useof a novel type of descriptor (or predictor)to describe chemicalcompounds that are used as drugs. In this version of the invention, thedescriptors (or predictors) comprise atom class pairs and the shortest(through compound) path length between the two focal atoms of the atomclass pair. An example of such atom class pairs and a compound is shownin FIG. 1.

[0237]FIG. 1 shows a compound that illustrates two quantitativedescriptors or predictors (with respective quantitative values) that useatom class pairs. In FIG. 1, a first quantitative descriptor and valueis denoted “OC—8—CCC” and a second quantitative descriptor and value isdenoted “CCCCC—2—CC”. The first descriptor consists of a first atomclass denoted “OC” and a second atom class denoted “CCC”. The firstletter in the denotation is the focal atom of the atom class and thefollowing letters on the list represent atoms attached to the focal atomof the atom class. Thus “O” is the focal atom of the first class, andthis “O” represents the Oxygen circled in FIG. 1; and the “C” of “OC”represents the single Carbon attached to the circled Oxygen. The first“C” of the second atom class pair denoted “CCC” is the circled aromaticcarbon and the following “CC” represents the two aromatic Carbonsattached to the circled aromatic Carbon. The number “8” is the number ofbonds in the shortest path through the molecule between the focal Carbonand the focal Oxygen of the atom class pair. (Attached hydrogens areconsidered in some of the descriptors, but not in this particularexample)

[0238] Thus, the first descriptor is partly denoted “OC—b—CCC”, wherein“OC” means an oxygen attached to only one carbon, “CCC” means a carbon(focal carbon) attached to only two carbons and b is the number of bondsin the shortest path through the molecule between the focal carbon andthe oxygen. For this particular compound (data object), “b” is 8.

[0239] However, this (partial) denotation of the first descriptor is notunique for the compound: there are three atom classes in the compoundthat are described by “OC” and there are six atom classes that aredescribed by “CCC”. And there are nine possible atom class pairs withdiffering “b” values that are described by “OC—b—CCC”. In order to makethe value of the first descriptor unique, the pair (or pairs) with thelargest “b” value is specified to be the descriptor. Thus the fulldenotation of the first descriptor is “PLHI: OC—b—CCC”, wherein PLHIstands for “Path Length High” and stipulates the pair or pairs with thelargest “b” value. The criterion of selecting the largest “b” value, thePLHI is known as a selecting criterion for the descriptor.

[0240] Thus the first descriptor is a quantitative descriptor and thevalue of the descriptor is “b”. The value “b” is a unique quantitativevalue for any one compound when the atom class pair of the firstdescriptor is present in the compound. When the atom class pair of thefirst descriptor is not present in a compound, the value of thedescriptor is “not applicable” or “missing”. (“not applicable” and“missing” are used interchangeably in this application, although makinga distinction has utility in some cases.)

[0241] Similarly the second descriptor value denoted “CCCCC—2—CC” inFIG. 1 represents the circled ring Carbon (focal atom) attached to fourcarbons that is 2 bonds away from the terminal circled Carbon (focalatom) attached to one Carbon. The number “2” is the number of bonds inthe shortest path through the molecule between the two focal carbons.

[0242] The second descriptor is partly denoted “CCCCC—b—CC”, wherein“CCCCC” means a Carbon (focal carbon) attached to only four carbons,“CC” means a carbon (focal carbon) attached to only one carbon and b isthe number of bonds in the shortest path through the molecule betweenthe two focal carbons. For this particular compound (data object), “b”is 2.

[0243] Similarly the full denotation of the second descriptor is “PLLO:CCCCC—b—CC”, wherein “PLLO” stands for “Path Length Low”. The seconddescriptor is a quantitative descriptor and the value of the descriptoris “b” or “missing”; the value of the second descriptor is unique forany one compound (data object).

[0244] The following is an example of the use of such quantitativedescriptors (or predictors) to analyze chemical compounds such as drugs.Quantitative descriptors (or predictors) similar the two descriptorsdescribed above are applied to a group of 159 chemical compounds. Adescriptor (or predictor)value is obtained for each descriptor (orpredictor)for each compound in the group. A property (or response) valueis obtained for each compound in the group. In this particular case, theproperty (or response) is drug potency.

[0245] In FIG. 2 a root node with 159 drugs (n=159) of average potency161 (u=161) has been split into three daughter nodes. The particulardescriptor used to make the segmentation is PLHI: CC—b—CNNO. The drugsin each of the three daughter nodes of the root have a similar potencyand descriptor values. The leftmost daughter node has 5 drugs (n=5) eachdrug of potency approximately 684 (u=684) and descriptor b values <=5,the middle daughter node has 34 drugs (n=34) each drug of potencyapproximately 271 (u=271) and descriptor b values 5<b<=6, and therightmost daughter node has 120 drugs (n=120) each drug of potencyapproximately 107 (u=107) and descriptor b values >6.

[0246] This nodal tree is generated by using segmentation and recursivepartitoning techniques, such as S/RP P techniques described in theDescription. (For some versions of the invention the segmentationprocess used on different parent nodes uses essentially differentsegmenting algorithms.) This nodal tree essentially sorts drugs intonodes so that each lower level node (close to the terminal nodes orleaves of the tree) has drugs with common atom class pairs that areseparated by similar path lengths and have similar potency values.Generally there is greater sorting, and greater potency and descriptorvalue similarity (homogeneity) of objects in lower level nodes of thetree. And generally there is less sorting, and less potency anddescriptor value similarity (homogeneity) of objects in higher levelnodes of the tree near the root of the tree.

[0247] Some lower level high potency nodes of the tree essentiallycontain drugs with high potency and similar descriptor values. It ispossible to conceptualize these descriptors (or predictors) and similarvalues as essentially clarifying the qualities of one or more effectivepharmacophores that are correlated to the high potency of the drugs.Each of these one or more effective pharmacophores essentially includesthe atom class pairs of some of the descriptors (or predictors) of thehigh potency nodes, the focal atoms of the pairs being separated by therespective quantitative values of the descriptors. The detailed path bywhich drug molecules are “split out” or segmented may be used to provideinformation on characteristics of molecular structures that areassociated with higher drug potency.

[0248] Alternatively, these lower level high potency nodes of the treeelucidate predictive, correlation or causal type relationships betweenone or more of the similar descriptor (or predictor) values and drugpotency.

[0249] In addition, the path (through the nodal tree) to a node whosedrugs have lower potency may be used to provide information oncharacteristics of molecular structures that have no bearing on, or thatactively inhibit drug potency.

[0250] A Conceptual Device to Aid in Further Understanding the Invention

[0251] Each node of the tree in this example 1 represents a group ofcompounds. A very helpful (nonlimiting) way to conceptualize a group (ornode) of compounds, the compounds' property (or response) and descriptor(or predictor) values is the matrix shown in FIG. 3. The compounds (ordata objects) are denoted O₁, . . . , O_(n). The compound property(response) is denoted as P and the compound descriptors (or predictors)are denoted as D₁, . . . , D_(M). In this example the property(response) is drug potency, the descriptors (or predictors) are atomclass pairs and distance, and generally each descriptor (orpredictor)value is a distance.

[0252] (Standard graph matching algorithms are used by versions of theinvention to find all instances of an atom class in a compound (seeUllman, JR. An algorithm for subgraph isomorphism. J. Assoc. Comput.Mach. 23: 31-42 (1976). This paper is incorporated herein by reference.Standard graph traversing algorithms known in the art or neighboringarts are used by versions of the invention to compute path distances.

[0253] Alternative descriptors (or predictors) for use in otherembodiments of the invention In example 1 above, the molecular featuresused were atom class pairs, and the focal atoms of each atom class inthe pair were the distance measurement endpoints. Alternate embodimentsof the invention use one or more other types of descriptor, includingone or more of the descriptors (or predictors) detailed underdescriptors, molecular descriptors and geometry-based moleculardescriptors in the Description section of the application.

[0254] Other illustrative uses of geometry-based molecular descriptorsby versions of the invention are shown and described in Golden Helixsales brochure: “Are you still taking a ‘brute force’ approach to HighThroughput Screening?”. This brochure is was included with U.S.Provisional application No. 60/225113 and is incorporated herein byreference to the fullest extent of the law.

[0255] The brochure presents an Example application of versions of theinvention on pages 11 through 15. This example application usescompounds from the NIH Developmental Therapeutics Program AIDS AntiviralScreen. The example application generates a nodal tree (Brochure pages12 and 13) that elucidates relationships between drug potency andmolecular characteristics of the drugs. In this example application,distances between atom class pairs (PLLO, PLHI) of drugs (molecularcharacteristics) are used to group or “segment” drugs into nodes. Drugswithin nodes of the tree have various degrees of “likeness” orhomogeneity in terms of drug potency and distances between one or moreatom class pairs. In this example application, high potency nodes (u≧6)are highlighted and one such node is circled.

[0256] Each node gives a descriptor (atom class pair, distance) used inthe segmentation that created the node. Nodes also contain variousnumbers such as “n” (the number of drugs in the node), “u” (the averagepotency of drugs in the node), “s” (standard deviation), “rP”(raw(unadjusted) p-value for the segmentation) and “aP” (Bonferroni adjustedp-value for the segmentation).

[0257] This Example application illustrates versions of invention'scapability to use a “training set” of compounds (drugs with knownpotencies) to generate a first nodal tree that correlates drug potencywith drug descriptor (or predictor)values (distances between atom classpairs). A nodal tree generated by such a training set, is then used topredict the potencies of other drugs based solely on the other drugs'descriptor values (distances between atom class pairs). It is possibleto use such a predictive capability to greatly increase the yield or“hit rate” in high-throughput screening (HTS).

[0258] In addition, versions of the invention generate a second nodaltree using a “validation set” of compounds. The compounds in thevalidation set were not present in the original training set, but alsohave known potencies. By confirming that the first and second nodaltrees are essentially the same (from a statistical standpoint), thefirst (training set) nodal tree is statistically validated. Such avalidation procedure tends to confirm the statistical reliability ofdrug potency prediction made using the original training set tree.

[0259] This sales brochure also illustrates a version of the invention'scapability to display actual relevant molecular structures (seescreenshot, p.12 and 13 of the brochure). Versions of the inventionallow a user to click on a node and visualize the compounds therein.Versions of the invention highlight the structural features that lead tocompound potencies in a node. Versions of the invention display othertypes of molecular structure representations. The similar such displayof any molecular structure representation is a version of the invention.(These examples are of course nonlimiting. Other versions of theinvention use one or more molecules or compounds that are not drugs, anduse one or more properties that are not drug potency.)

[0260] Versions of the invention use human interaction/intervention as apart of growing one or more nodal trees. Such userinteraction/intervention includes the selection of compounds for study,selection of molecular properties for study, selection of moleculardescriptors (or predictors), selection of one or more stop criteria toterminate tree growth.

[0261] Versions of the invention also make use of molecular descriptors(or predictors) that are not geometry-based molecular descriptors incombination with one or more geometry-based molecular descriptors. Suchversions of the invention group molecular data objects into nodes (orgroups) that are similar in terms of both geometry-based and nongeometry-based molecular descriptors (or predictors). Moleculardescriptors (or predictors) and other types of descriptors (orpredictors) are given in published PCT application (1) PCT/US98/07899,as well as published papers (2) Hawkins, et. al., Analysis of LargeStructure-Activity Data Set Using Recursive Partitioning, Quant.Struct.-Act. Relat. 16, 296-302 (1997). (3) Rusinko, et. al., Analysisof a Large Structure/Biological Activity Data Set Using RecursivePartitioning, J. Chem. Inf. Comput. Sci. 1999, 39, 1017- 1026 and thetwo sales brochures. References (1), (2) and (3) of the precedingsentence are incorporated herein by reference to the fullest extent ofthe law.

[0262] Other examples of real-world data or real-world data objects thatare handled bv versions of the invention

[0263] Versions of the invention handle other kinds of real world data.Examples: (1) in oil exploration: Well logs. The measures (descriptor(s)or response(s))are physical and electrical properties of the rockmeasured at increasing depths. A segmentation into sections ofcomparable physical and electrical properties yields estimates of thesubsurface stratigraphy. Hawkins and Merriam, Mathematical Geology, 1974(which is incorporated herein by reference, see ref. 5 in endnotes). (2)In mining or geology Transects across fields. The measurements(descriptor(s) or response(s)) are soil composition. A segmentationgives rise to maps showing different types of soil. (Webster,Mathematical Geology early 1970s and see ref. 1 in endnotes which areincorporated herein by reference.) (3) market segmentation research. Themeasures (descriptors/predictors) are demographic and the dependent(response) is the propensity to take a particular action—for example topurchase a boutique coffee. Fitting the recursive partitioning modelwill then lead to identification of market segments, along with the sizeand demographic characteristics of each segment. Any marketing use thatis similar and known in the field of marketing is also a version of theinvention. (4) credit card scoring:The dependent variable is aborrower's history of responsible use of credit. The explanatoryvariables are demographic and financial characteristics of the borrower.The object is to find valid credit scores. Any credit use that issimilar and known in the field of credit is also a version of theinvention. (5) demographic tax studies: The dependent variable is ameasure of tax compliance. The predictors are characteristics of the taxform. The purpose is identification of forms likely to be non-compliant.(A student of Prof. Hawkins did an MS thesis research project on thistopic. An official copy of the thesis is with the Univ. Minnesotalibrary and is incorporated herein by reference. The student is DavidMcKenzie, who graduated in 1993 and his thesis applied FIRM to MinnesotaDepartment of Revenue tax returns.) Any tax use that is similar andknown in the field of revenue is also a version of the invention. Otherexample applications are in U.S. Pat. Nos. 4,719,571; 5,787,274;6,182,058; 6,434,542, U.S. patent publication T998008, book RecursivePartitioning in the Health Sciences by Zhang and Singer, 1999Springer-Verlag. Each of these is incorporated herein by reference tothe fullest extent of the law.

[0264] Hardware

[0265] For the present invention described in this application, versionsof the invention and computer-based methods described herein are notlimited as to the type of computer on which they run. The computertypically includes a keyboard, a display device such as a monitor, and apointing device such as a mouse. The computer also typically comprises arandom access memory (RAM), a read only memory (ROM), a centralprocessing unit (CPU) and a storage device such as a hard disk drive,floppy disk drive or CD-ROM. It possible for the computer to comprise acombination of one or more of these components as long as suchcombination is operable and not mutually exclusive. For example,multiple processors are possible; or a device that functions in place ofa keyboard is possible; or the keyboard is eliminated in some versions;or more peripheral storage devices such as a floppy drive or CD-ROM areeliminated. Another way to describe components of such a typicalcomputer is processor means (or component), memory means (or component),display means (or component), pointing means (or component), peripheralmemory means (or component). An Input/Output means (or component) ispart of some versions of such a computer. However, as stated above, sucha typical computer is described only as an example. And these examplesare not limiting. And in general, versions of the invention run on anygeneral purpose digital computer. Versions of the invention run onplatforms such as Windows NT/95/98, Linux; and UNIX variants.

[0266] A Note on Data Handled

[0267] Versions of the invention handle data that is partly real worldand partly simulated, e.g (1) one or more data objects real-world, oneor more data objects abstract; (2) one or more descriptor values or oneor more property values simulated and one or more descriptor values orone or more property values real. Other similar combinations arepossible.

[0268] Genetics/pharmacogenomics

[0269] Also included with U.S. provisional patent application No.60/225113 which is a priority document for this application is the salesbrochure: “Is your company taking advantage of the revolution inpharmacogenomics?” by Golden Helix, Inc. This brochure illustrates anddescribes one or more versions of the invention that use segmentingalgorithms or recursive partitioning (or both) in the field of geneticsor pharmacogenomics. This brochure is incorporated herein by referenceto the fullest extent of the law. In such a genetics or pharmacogenomicscontext, an example of a data object is an individual creature, such asa human being. Another example of a data object is tissue from acreature. In such a context an example of a property is a phenotypiccharacteristic of a creature. A description (or descriptor) of a geneticmakeup (of a creature) includes, but is not necessarily limited to, (1)a combination of one or more genotypes at one or more polymorphisms (2)a combination of one or more alleles at one or more polymorphisms and(3) a combination of one or more haplotypes (4) a combination of two ormore of (1), 2, or (3). An example of a property is a phenotypiccharacteristic A phenotypic characteristic includes (but is not limitedto) positive or negative drug response. A phenotypic characteristic isan observable or inferable inherited genetic characteristic or inheritedgenetic trait including a biochemical or biophysical genetic trait, forexample an inherited disease is a genetic characteristic, apredisposition to an inherited disease is a genetic characteristic. Aphenotypic characteristic, phenotypic property or character is a geneticcharacteristic. The distinction between a phenotypic characteristic anda genetic descriptor is somewhat arbitrary. The above terms (such asdescriptor, property, data object, creature, phenotype, genetic make-up,tissue) to describe versions of the invention, include any similar orequivalent term known to those of ordinary skill in genetics orpharmacogenomics. Such terms including any term which is essentially adescriptor, property, creature, data object, tissue in any phase of thepharmaceutical industry.

[0270] A biological property or characteristic, or an observable orinferable characteristic including a biochemical or biophysicalcharacteristic is used by versions of the invention to characterize(using a descriptor or property value) a creature, tissue from creature.

[0271] Unless specifically stated otherwise some embodiments of eachversion of any process or apparatus that segments data described hereinoutput data in segmented form (1) to a monitor, LCD, CRT, printer orequivalent device for use by a human user or users or (2) to a memorydevice such as a hard drive or (3) for sending over media such as theinternet. Any computer-based method (or apparatus) that uses data insegmented form from a version of the invention (or equivalent invention)described herein, wherein the method essentially outputs data to amonitor or equivalent device is a version of the invention. Anyapparatus that practices any process described herein that is a versionof the invention is also a version of the invention.

[0272] Any data structure described herein or generated by any versionof the invention (either during its operation or essentially as an endresult) described herein is a version of the invention. A data structureor other version of the invention described herein that is on a computerreadable medium such as a CD-ROM, flash ROM, RAM, hard drive or embeddedin a computer readable transmission signal (ie electromagnetic oroptical) is a version of the invention. The data in some data structuresgenerated by versions of the invention, such as nodal trees, hierarchiesof candidate score values, or during the calculation of best scoresubsets are functionally interrelated and also essentially requireprocessing by a computer.

[0273] Versions of the invention that are similar to versions describedherein operate by sending or receiving (or both) information (includingover media such as the Internet). Versions of the invention are any oneprocess described or claimed herein, wherein the process comprisessending or receiving information in one or more steps, step-like partsor parts of the process. And any apparatus that practices such a processis a version of the invention.

[0274] Scope of the Invention

[0275] It is generally possible for any process described herein whichhandles real-world data to be practiced with an additional (furtherincluded) step or step-like part of data gathering or collection, suchas actual physical collection. And such a process is also a version ofthe invention. All the features disclosed in this specification(including any claims and drawings), and one or more of the steps of anymethod or process so disclosed, may be used in any combination, exceptcombinations where at least some of such features and/or steps aremutually exclusive. Each feature disclosed in this specification(including any claims and drawings), may be replaced by alternativefeatures of the same, equivalent or similar purpose, unless expresslystated otherwise. Thus, unless expressly stated otherwise, each featuredisclosed is one example only of a generic series of equivalent orsimilar features.

[0276] While the description contains many specificities these representexemplifications of versions of the invention and do not limit the scopeof the invention. Therefore the reader's attention should also bedirected to the claims and their legal equivalents and to equivalentversions of the invention not specifically described.

[0277] Versions of the invention illustratively disclosed hereinsuitably may be practiced in the absence of any element which is notspecifically disclosed herein. Versions of the present inventionillustratively disclosed herein suitably may be practiced wherein one ormore of the terms “comprising”, is replaced by one or-more of“consisting”, “consisting essentially”, or “consisting predominantly”.The references in the endnotes are incorporated herein by reference tothe fullest extent of the law. ¹⁰

[0278] Technical Field Versions of the invention have applications inmany areas, including analysis of real-world data. Some versions arespecifically in the area of high-throughput screening ofpharmaceuticals. Some versions are applicable in pharmacogenomics. Someversions are applicable in mining, marketing studies, and other appliedareas.

[0279] DPSA Appendix: DPSAs and Fast Segmenting Algorithms

[0280] Professor Douglas Hawkins has worked in the field of segmentingdata using statistical and computational methods for many years.Professor Hawkins discovered an important segmenting algorithm manyyears ago.¹¹ The algorithm is an O(n²) dynamic programming algorithm tofind the optimal cutpoints for a set of segments [see references]. Thisalgorithm, while much faster than an exhaustive search (or computation),nevertheless can run very slowly when segmenting large quantities ofdata. (An algorithm that runs in time O(n²)is one where the time tosolve the problem is proportional to the square of the input size.)

[0281] Embodiments of our methodology (Fast Segmenting Algorithms) runin time proportional to O(n¹⁵), O(n log n), or even O(n). Whenperforming segmenting on real world data, our algorithms can make thedifference between solving a problem in seconds instead of hours.

[0282] We first describe the Hawkins algorithm by way of illustration inorder to teach versions of the invention (Fast Segmenting Algorithm).The basic principle behind dynamic programming is that partial solutionsof a problem that have been computed earlier can be stored and usedlater in the computation to reduce the amount of time spent. Hawkinsuses this principle in his algorithm.

[0283] Hawkins's Dynamic Programming (DP) Algorithm to Find OptimalSegmentation of a Group of Data Points

[0284] Discrete data points (or values) in a sequential order.

[0285] Let y₁, y₂, y₃, . . . , y_(n−2), y_(n−1), y_(n) be a group of ndiscrete data values or data points. (It is also possible to speak ofthese n discrete data values (or points) as a vector of data, whereinthe vector has length n. And it is also possible to speak of these ndiscrete data values as vector y.)

[0286] “Segmenting” such a group of points into nonoverlapping“segments”.

[0287] It is possible to subgroup these n data points into k segments(k≦n), so that each of the n data points belongs to one and only onesegment. This process of “segmenting” the n data points into k segmentsis a process of forming k disjoint subgroups of contiguous points.(These k segments are referred to herein as a k-way split or k segmentcovering. An alternative expression is a k-segment segmentation.)

[0288] Segmenting in such a way that the data points within each segmentare homogeneous.

[0289] It is possible to segment a group of sequential data points intok segments many ways. (In particular there are C(n, k−1)=n!/[(n−k+1)!(k−1)!] possible coverings of n data points into k segments.) However,it is a goal of a segmenting algorithm (or segmentation process) thatthe points within each segment be essentially similar in value orhomogeneous. Thus a segmenting algorithm essentially chooses (orprefers) only coverings for which the data points within each segmentare essentially homogeneous (in value).

[0290] A measure of data Point segment homogeneity: the sum of squareddeviations of the data points within the segment about their mean.

[0291] To achieve the goal homogeneity of data points within eachsegment of a covering, Hawkins chooses a measure of the homogeneity ofthe data values within each segment (for a possible covering). Themeasure of homogeneity used for any one segment is the sum of squareddeviations of the data points within the segment about their mean. Let1≦i≦j≦n. For a segment corresponding to points i, i+1, j−1, j, the meanof the data values within the segment is given by $\begin{matrix}{{\overset{\_}{y}}_{i,j} = \frac{\sum\limits_{m = i}^{j}\quad y_{m}}{j - i + 1}} & {{Equation}\quad 1}\end{matrix}$

[0292] And the measure of homogeneity (the sum of squared deviations ofthe data points within the segment about their mean) is denoted as r(i,j). $\begin{matrix}{{r\left( {i,j} \right)} = {\sum\limits_{m = i}^{j}\quad \left( {y_{m} - {\overset{\_}{y}}_{i,j}} \right)^{2}}} & {{Equation}\quad 2}\end{matrix}$

[0293] The measure of homogeneity, r(i, j), is a low value if thesegment is homogeneous (i.e. if the data point values yi, y_(i+1), . . ., y_(j−1), y_(j) are similar or homogeneous). The measure r(i,j) is thescore function for a segment.

[0294] Summing all of the r(i, i) for a covering gives a measure ofoverall homogeneity for the covering. By adding all the r(i,j) valuesfor a covering, a measure of the overall homogeneity of the data pointswithin each segment (of the covering ) is obtained. Denoting the datapoints within the k segments of a covering as the values from 1 to n₁,n₁+1 to n₂, . . . , n_(k−1)to n; an overall measure W, of thehomogeneity of the segments (of the covering) is given by W=r(1,n₁)+r(n₁+1 , n₂)+. . . +r(n_(k−2)+1, n_(k−1))+r(n_(k−1)+1, n). Smallvalues of W then correspond to higher degrees of homogeneity within thesegments (of a covering). With such a strategy, an appropriate choice ofsegments (for a covering) is to choose values of n₁, n₂, . . . , n_(k−1)for which W is minimized. The overall measure W is the score functionfor a split or covering. Hawkins then proceeds to show how to find suchan optimal set of k segments by using a dynamic programming computeralgorithm.¹²

[0295] Hawkins's Dynamic Programming (DP) Algorithm for finding anoptimal covering of n data points using k segments.

[0296] Hawkins's algorithm is based on the following principle. Given ndata points, and an optimal covering using k segments (or a best k-waysplit), the last endpoint (or cutpoint) of the covering is n_(k−1)+1.Since this k-segment covering is an optimal covering for the data pointsfrom 1 to n, it follows that this covering is composed of an optimal k−1segment covering for the data points from 1 to n_(k−1), plus the lastsegment covering points n_(k−1),+1 to n.¹³ Thus if the optimal k−1segment coverings for data points 1 to m, for each point m,1≦m≦n isknown, then it is easy to find the optimal k segment coverings for thedata points from 1 to n. This is done using a simple search.

[0297] Simple search for finding the optimal k-way splits when theoptimal (k−1)-way splits are known.

[0298] The simple search is done as follows. Let the point m be the lastdata point in a series of points 1 to m. Let F_(k−1)(m) be defined asthe measure W (or score) for an optimal (k−1)-way split for the pointsfrom 1 to m. (W is then a minimum.) Similarly let F_(k)(n) be themeasure W (or score) for an optimal k-way split for the data points 1 ton. It follows that F_(k)(n)=min {F_(k−1)(m)+r(m+1,n)} for k−1≦m≦n. Thesimple search is done by calculating the n-k values ofF_(k−1)(m)+r(m+1,n) for each value of m from k−1 to n and finding theminimum or minima.

[0299] Using the simple search recursively leads to an algorithm forfinding the optimal k-way split for data points 1 to n.

[0300] As we have seen above, the optimal k-way splits coverings can bededuced from the optimal k−1 splits using a simple search. Since 1 waysplits are unique, the optimal 2 way splits are deduced from them. Andthe optimal 3 way splits are deduced from the optimal 2 way splits.Applying this process recursively, the optimal k-way splits are finallydeduced from the optimal (k−1)-way splits. This then is essentiallyHawkin's algorithm.

[0301] Formal Presentation of Hawkins's DP Algorithm

[0302] Using the ideas presented above, Hawkins formally presents hisalgorithm.¹⁴ Algorithm: Let F_(j)(m) be the measure W (within segmentsum of squared deviations) for an optimal j-way split for the datapoints 1 to m. Then F₁(m)=r(1,m) for m=1, 2, . . . ,n. And, F_(j)(m)=min{F_(j−1),(v)+r(v+1, m)}, j−1≦v≦m−1.

[0303] Computational tables of F_(j)(m) are generated for m=1 to n andj=1, 2, 3, k. The value of W for an optimal k-way split on n data pointsis F_(k)(n) and F_(k)(n) is deduced as described above. The boundariesof the optimal segments are deduced from a “traceback” procedure.Similar algorithms are also presented.¹⁵

[0304] Segmenting with Missing or “Float” Values

[0305] Musser's thesis describes how to handle missing values withinHawkin's DP. It is often the case with real-world data that descriptorswill take on missing or “floating” values. In this case, it is stillpossible to segmenting using the missing values as predictors. Themissing values can either be put in their own segment, or grouped withone of the other segments. The choice of which segment the missing casesshould be put with is done so as to maximize a measure r(i,j) of segmenthomogeneity. We can define a function F*_(k)(m) that gives the optimalmeasure for a k-way split that includes missing values, and r*(i,j) asthe measure for a segment containing data values y_(i) through y_(j)with missing values placed within that segment. Then the recursionbecomes ${F_{j}^{*}(m)} = {\min \left\{ \begin{matrix}{{\min \left\{ {{F_{j - 1}^{*}(v)} + {r\left( {{v + 1},m} \right)}} \right\}},{{j - 1} \leq v \leq {m - 1}}} \\{{\min \left\{ {{F_{j - 1}(v)} + {r^{*}\left( {{v + 1},m} \right)}} \right\}},{{j - 1} \leq v \leq {m - 1}}}\end{matrix} \right.}$

[0306] The top part of the equation puts the missing values somewhereamong the segments in the left half of the data. The latter puts themissing values with final segment. The case where the missing values areall alone would be where r*(v+1,m) is empty and only missing values arecontained in that segment. Operationally, one must tabulate separatelyFj's and F*j's in order to handle missing values.

[0307] A more detailed examination of Hawkins's algorithm.

[0308] To better understand the Hawkins DP algorithm, the followingtable is presented that illustrates the workings of the algorithm.(There is no such actual pictorial table in Hawkins's published paperson this topic.) This table illustrates the tabulation of values ofF_(j)(m) that are generated for m=1 to n and j=1, 2, . . . , k by thealgorithm. In this illustration we essentially diagram the process ofobtaining the tables of values of F_(j)(m) in a pictorial form.

[0309] By making the pictorial table, we diagram the process so it canbe further understood. First we compute a vector that has in positions 1. . . n the values for F₁(1), F₁(2) . . . F₁(n).Then we compute F₂(2),F₂(3) . F₂(n) in terms of F₁(1), F₁(2) . . . F₁(n). We continue thisprocess up until k=4 segments, depicted in the following table. (Such atable is exemplary, nonlimiting and merely illustrative and can be drawnfor any value of k.) The table, Table 1, is given on the following page.TABLE 1 F(1) F(2) F(3) F(4) F(5) . . . F(6) F₁( ) r(1, 1) r(1, 2) r(1,3) r(1, 4) r(1, 5) . . . r(1, n) F₂( ) 0 F₂(2) = F₂(3) = min{ F₂(4) =min{ F₂(5) = min{ . . . F₂(n) = min{ F₁(1) + r(2, 2) F₁(1) + r(2, 3),F₁(1) + r(2, 4), F₁(1) = r(2, 5), . . . F₁(1) + r(2, n), F₁(2) + r(3,3)} F₁(2) + r(3, 4), F₁(2) + r(3, 5), . . . F₁(2) + r(3, n), F₁(3) +r(4, 4)} F₁(3) + r(4, 5), . . . F₁(3) + r(4, n), F₁(3) + r(5, 5)} . . .F₁(3) + r(5, n), . . . F₁(n − 1) + r(n, n)} F₃( ) 0 0 F₃(3) = min{ F₃(4)= min{ F₃(5) = min{ . . . F₃(n) = min { F₂(2) + r(3, 3)} F₂(2) + r(3,4), F₂(2) + r(3, 5), . . . F₂(2) + r(3, n), F₂(3) + r(4, 4)} F₂(3) +r(4, 5), . . . F₂(3) + r(4, n), F₂(4) + r(5, 5)} . . . F₂(4) + r(5, n),. . . F₂(n − 1) + r(n, n)} F₄( ) 0 0 0 F₄(4) = min{ F₄(5) = min { . . .F₄(n) = min{ F₃(3) + r(4, 4)} F₃(3) + r(4, 5), . . . F₃(3) + r(4, n),F₃(4) + r(5, 5)} . . . F₃(4) + r(5, n), . . . F₃(n − 1) + r(n, n)}

[0310] The zeros in the table are where it is impossible to have a k-waysplit when there are only k−1 or less data points. The score for theoptimal 4-way split is given by F₄(n), which is the bottom rightmostentry in the table. The actual positions where the splits occur can betraced if you keep an additional table of the position where the minimumvalue occurred for each cell in the table. The algorithm is O(kn²). Fora given row past the first row, the rightmost column takes the minimumof n-1 items, the next to the left takes n-2, so on down to zero. Therunning time for a given row is thus given by O(n²). Because there are krows for a k-way split, and it costs O(n²) to compute the entries for arow, the total running time is thus O(kn²).

[0311] Fast Segmenting Algorithm Description

[0312] By drawing the computations for the Hawkins O(n²) in a tabularform, it is possible to make some novel observations about thecomputation, and derive new faster algorithms. Consider the cells thatcompute the values for F₃(4) and F₃(5). The first element (or candidatescore) of the minimum for these two rows is given by F₂(2)+r(3,4) andF₂(2)+r(3,5) respectively. Suppose that F₂(2)+r(3,5) was the lowestscore for that cell. It does not follow that F₂(2)+r(3,4) will be thelowest score for its cell, but because the score computation differsonly by a single element (or data point, y₅), and the same element (ordata point or observation) is removed from the score of each potentialminimum in the cell, it is reasonable to expect that it will be amongthe lowest scores for its cell. This is a key concept. (The valuesF₂(2)+r(3,5) and F₂(2)+r(3,4) are equal level candidate values ofadjacent cells of a row in the table. These two values differ only bythe data point y₅. F₂(2)+r(3,5)=C₃(2,5) and F₂(2)+r(3,4)=C₃(2,4), seedefinitions section for more on candidate values and equal levelcandidate values.)

[0313] If we can take the smallest c scores for the rightmost cell in arow, if c is sufficiently large, we are guaranteed with high certaintythat the minimum score in the next column to the left will be amongthose c scores, adjusted to remove the observation (or data point)dropped out of the cell to the left. Furthermore, if c is sufficientlylarge, we are likely to find the best score for subsequent columns amongthose c scores. However, because we drop an observation (or data point)each time, thus changing the score a bit each time, we will eventuallyhave to recompute a new set of scores from scratch. These ideas lead tothe following new algorithms.

FSA TEACHING EXAMPLE 1

[0314] 1. Compute F₁(1) . . . F₁(n) in O(n) time using a cumulative sum.

[0315] 2. Compute F₂(n), saving the best °{square root over (n)} scores.Computing the smallest {square root}{square root over (n)} elements ofan n element vector can be done in O(n) time. This is done with aselection algorithm (or similar algorithm, or one or more algorithmsthat achieve essentially the same result) in O(n) time, see chapter tenof reference Cormen (1990).

[0316] 3. Compute F₂(n−1) by removing the observation from the {squareroot}{square root over (n)} best scores, and computing the minimum ofthose updated scores. This can be done in {square root}{square root over(n)} time. Repeatedly do this updating procedure to compute F₂(n−2) . .. F₂(n−{square root}{square root over (n)}).

[0317] 4. At this point as in step 2, we go through all of theapproximately n−{square root}{square root over (n)} scores and save thesmallest {square root}{square root over (n)} scores. Then as in step 3,compute the next °{square root over (n)} entries of the table using theupdating procedure.

[0318] 5. Repeat steps 3 and 4 {square root}{square root over (n)} timesuntil all entries in the row have been computed.

[0319] 6. We have now computed F₂(1) . . . F₂(n). We can repeat the samesteps 2 through 5 to compute F₃(1) . . . F₃(n), and so on up until wehave computed k rows of the table to find the best k-way split.

[0320] The running time of this algorithm is O(n{square root}{squareroot over (n)})=O(n^(1.5)). It costs us O(n) steps to compute a subset.We compute a subset {square root}n times, giving a running time ofO(n{square root}{square root over (n)}). We also do an updatingprocedure on {square root}{square root over (n)} items n times, giving arunning time of O(n{square root}{square root over (n)}).

[0321] Versions of the invention take smaller subsets, and recomputeless frequently. This speeds up the algorithm, possibly at the expenseof giving less optimal splits. Another embodiment of the invention thatruns faster but has a higher chance of giving suboptimal splits is asfollows.

FSA TEACHING EXAMPLE 2

[0322] 1. Compute F₁(1) . . . F₁(n) in O(n) time using a cumulative sum.

[0323] 2. Compute F₂(n), saving the best log n scores. This is done witha randomized selection algorithm (or similar algorithm, or one or morealgorithms that achieve essentially the same result) in O(n) time.

[0324] 3. Compute F₂(n−1) by removing the observation from the log nbest scores, and computing the minimum of those updated scores. This canbe done in log n time. Repeatedly do this updating procedure n/log ntimes to compute F₂(n−2) . . . F₂(n-(n/log n)).

[0325] 4. At this point as in step 2, we go through all of theapproximately n-(n/log n) scores and save the smallest log n scores.Then as in step 3, compute the next n/log n entries of the table usingthe updating procedure.

[0326] 5. Repeat steps 3 and 4 log n times until all entries in the rowhave been computed.

[0327] 6. We now have computed F₂(1) . . . F₂(n). We can repeat the samesteps 2 through 5 to compute F₃(1 ) . . . F₃(n), and so on up until wehave computed k rows of the table to find the best k-way split.

[0328] The running time of this algorithm is 0(n log n). It costs us0(n) steps to compute a subset. We compute a subset log n times, sincewe recompute every n/log n steps. We also do an updating procedure onlog n items n times.

FSA TEACHING EXAMPLE 3

[0329]1. Compute F₁(1) . . . F₁(n) in O(n) time using a cumulative sum.

[0330] 2. Compute F₂(n), saving the best c₁ scores, where c₁ is aconstant. This is done with a randomized selection algorithm (or similaralgorithm, or one or more algorithms that achieve essentially the sameresult) in O(n) time.

[0331] 3. Compute F₂(n−1) by removing the observation from the c₁ bestscores, and computing the minimum of those updated scores. This can bedone in constant time. Repeatedly do this updating procedure n/c₁ timesto compute F₂(n−2) . . . F₂(n-(n/c₁)).

[0332] 4. At this point as in step 2, we go through all of theapproximately n-(n/c₁) scores and save the smallest c₁ scores. Then asin step 3, compute the next n/c₁ entries of the table using the updatingprocedure.

[0333] 5. Repeat steps 3 and 4 c₁ times until all entries in the rowhave been computed.

[0334] 6. We now have computed F₂(1) . . . F₂(n). We can repeat the samesteps 2 through 5 to compute F₃(1 ) . . . F₃(n), and so on up until wehave computed k rows of the table to find the best k-way split.

[0335] The running time of this algorithm is O(n). It costs us O(n)steps to compute a subset. We compute a subset a constant c₁ times,since we recompute every n/c₁ steps. We also do an updating procedure onc₁ items n times.

[0336] Alternate embodiments of the invention take various subset sizesand recompute the subset at various intervals. Rather than having subsetsizes of exactly {square root}{square root over (n)}, it is desirable insome cases to take some constant factor multiplied by {squareroot}{square root over (n)}. Similarly this is the case with the otherquantities. As is well known in analysis of algorithms, changing theseconstant factors will not change the overall asymptotic functional formof running time of the algorithm. However, it could have largeconsequences in the actual time spent, and on the optimality of thesolution.

[0337] Reference: Cormen, T. H.; Leiserson, C. E. and Rivest, R. L.(1990) Introduction to Algorithms, Cambridge, Mass.: The MIT Press.

[0338] Versions of fast segmenting algorithms are described above. Theuse of a computer-based method that uses one or more of these algorithms(or similar algorithms) to segment data objects, including data objectsthat represent real world objects is a version of the invention. Anyinvention, process or apparatus, or similar entity that includes one ormore of these (or similar) algorithms is a version of the invention.

[0339] Versions of the fast segmenting algorithm calculate by adding orremoving an observation from a cell using techniques of “running sums”,a well known technique in computer science. Versions of the invention(algorithm) described above compute F values in each of the cells of thetable by following a certain “path” of computation. This path computesF₁( ) values first, then down to the rightmost cell in the second rowand backwards. Other versions of the invention follow differentcomputational paths to calculate F values. For example, a versioncalculates F₁( ) values first, then calculates F₂( ) values for aninterior cell first (saving the best c scores), and then follows a pathto the right and to the left along the second row computing c F₂( )scores for second row cells. Similar variations of the computationalpath described above are followed by various versions of the inventionto compute all or essentially all cells in the table.

[0340] Some such versions recompute all or essentially all the scores ina cell at various intervals as described above. Some versions of theinvention do not recompute all or essentially all scores in a cell atperiodic intervals or any interval. Versions of the invention areoperable and have utility for score functions, deviance measures,statistical measures of homogeneity or equivalent measures other thanthe sum of squares type score function described above. These includemeasures of homogeneity or equivalent measures similar to thosediscussed in references 1-9.

General Definitions

[0341] Some concepts behind versions of Fast Segmenting Algorithms havebeen described above, general definitions are given here to allow a moregeneral description of versions of the invention (Fast SegmentingAlgorithms).

[0342] General definition of a measure of segment homogeneity, r(i, j),(measure of homogeneity of data points within a segment. Let y₁, y₂, y₃. . . , y_(n−2), Y_(n−1), y_(n) be a group of n discrete data values ordata points in a sequential order. And let 1≦i≦j≦n. For a segmentcorresponding to points i, i+1, . . . , j−1, j. Specific examples of ameasure of segment homogeneity include (1) sum of squared deviations ofthe data points within the segment about their mean, (2) sum of theabsolute values of the deviation of each data point within the segmentfrom the within segment data point mean, (3) a measure of the varianceof the data points within the segment. Other examples are given inequations 3 and 4 below, wherein z is a positive number. The values 1and 2 are preferred values for z. $\begin{matrix}{{r\left( {i,j} \right)} = {\sum\limits_{m = i}^{j}\left| \quad \left( {y_{m} - {\overset{\_}{y}}_{i,j}} \right) \right|^{z}}} & {{Equation}\quad 3} \\{{r\left( {i,j} \right)} = {z\sqrt{\sum\limits_{m = i}^{j}\left| \quad \left( {y_{m} - {\overset{\_}{y}}_{i,j}} \right) \right|^{z}}}} & {{Equation}\quad 4} \\{{\overset{\_}{y}}_{i,j} \cong \frac{\sum\limits_{m = i}^{j}\quad y_{m}}{j - i + 1}} & {{Equation}\quad 5}\end{matrix}$

[0343] In equation 5, the mean is an exact or approximate mean. Versionsof the invention segment responses y_(i) that are from a binomialdistribution, where there are only two possible values that the y_(i)responses can take. Let us denote these values as zero (0) or one (1).Then a measure of segment homogeneity that is preferred for versions ofthe invention that use binomial responses is given by equation 6.

r(i,j)=−2(j−i+1)(y _(i,j)log(y_(i,j))+(1−{overscore (y)}_(i,j))log(1−{overscore (y)} _(i,j)))  Equation 6

[0344] Multivariate Versions of the Invention

[0345] Up until now, we have considered the cases where y_(i) areunivariate values. Versions of the invention use multivariate or vectorvalued responses, where we have a sequence of p-component multivariatevectors Y_(i), i=1, 2, . . . , n. Measures of homogeneity betweenvector-valued responses known to a person of ordinary skill instatistics define versions of the invention. One such measure ofhomogeneity is the Pillai-Bartlett-Nanda trace (or Pillai trace forshort) statistic. Define the mean vector of the multivariate vectors as:$\begin{matrix}{{\overset{\_}{Y}}_{i,j} = {\frac{\sum\limits_{m = i}^{j}\quad Y_{m}}{j - i + 1}.}} & {{Equation}\quad 7}\end{matrix}$

[0346] Define the total sum of squares and cross-products matrix as:$\begin{matrix}{S = {\sum\limits_{m = 1}^{n}\quad {\left( {Y_{m} - {\overset{\_}{Y}}_{1,n}} \right)\quad {\left( {Y_{m} - {\overset{\_}{Y}}_{1,n}} \right)^{T}.}}}} & {{Equation}\quad 8}\end{matrix}$

[0347] Then the following multivariate segment homogeneity measuredefines a version of the invention that operates on multivariateresponses: $\begin{matrix}{{r\left( {i,j} \right)} = {{trace}\left( {S^{{- 1}/2}{\sum\limits_{m = i}^{j}\quad {\left( {Y_{m} - {\overset{\_}{Y}}_{i,j}} \right)\quad \left( {Y_{m} - {\overset{\_}{Y}}_{i,j}} \right)^{T}\left( S^{{- 1}/2} \right)^{T}}}} \right)}} & {{Equation}\quad 9}\end{matrix}$

[0348] The matrix inverse square root of S serves to standardize thedata, and the trace of the matrix gives a single number as a value forr(ij), allowing us to use the rest of the dynamic program unaltered.This measure of homogeneity is most appropriate when the data vectorsare essentially normally distributed. When the data vectors are binary,then a more appropriate statistic is to use a higher dimensional analogof equation 6. If the vector is p-dimensional, we simply sum up theone-dimensional r(ij) measures for each dimension of the vector. Othermeasures known by a person of ordinary skill in statistics may be usedto test for segment homogeneity, including the Hotelling T_(—)0 squaredstatistic.

[0349] Other examples of a measure of segment homogeneity is anyfunction that is a monotonic or essentially monotonic function(including linear or essentially linear function) of any one of theabove described measures of segment homogeneity. Also any measure ofsegment homogeneity known to a person of ordinary skill in statistics orthe segmenting of data by statistical or computational methods is anexample of an segment measure of homogeneity.

[0350] General definition of a measure of overall homogeneity, W*, of acovering of s segments (or an s-way split or segmentation) of dconsecutive data points. By adding all the r(i,j) values for a covering,a measure of the overall homogeneity of the data points within eachsegment (of the covering) is obtained. Denoting the d data points withinthe s segments of a covering as the values from 1 to n₁, n₁+1 to n₂, . .. , n_(k−1) to n; an overall measure W*, of the homogeneity of thesegments (of the covering) is given by

[0351] W*=r(1, n₁)+r(n₁+1, n₂)+. . . +r(n_(k−2)+1, n_(k−1))+r(n_(k−1)+1,n). Preferred measures of overall homogeneity W* is any measure derivedfrom a preferred measure of segment homogeneity (such as sum of squareddeviations or sum of absolute value deviations type measures). Otherexamples of a general overall measure of homogeneity of a covering isany function that is a linear or essentially linear function of any oneof the above described W*. Also any measure of overall homogeneity of acovering known to a person of ordinary skill in statistics or thesegmenting of data by statistical or computational methods is an exampleof a measure of overall segment homogeneity. Smaller values of W*correspond to higher degrees of homogeneity within the segments (of acovering) for some measures W*. (It is also possible for larger valuesof W* to correspond to higher degrees of homogeneity within the segments(of a covering) for some measures W*.)

[0352] Let F_(k,W*)(n) be the value of W* for an optimal k-way split forthe data points 1 to n for some measure W*. (W* is then a maximum or aminimum.)

[0353] A computational segmenting table is a nonlimiting pictorialcharacterization of the operation of a segmenting algorithm that finds ak-way split of n sequential data points (y₁, y₂, . . . , y_(n)). Splitsfound by such a segmenting algorithm include definite optimal,approximate definite optimal, probable optimal, approximate probableoptimal and statistically meaningful k-way splits. (In some cases analgorithm finds other types of splits.) Table 1 is an example of acomputational segmenting table.

[0354] A computational segmenting table is similar to a matrix informat, with one or more rows and one or more columns. Eachcomputational segmenting table has a value for F_(j)(m), for a pair ofvalues of j and m, wherein 1≦j≦k,1≦m≦n. Each pair j, m) corresponds to acell in the table. For a given pair and cell, j corresponds to the rownumber and m corresponds to the column number of the cell. For eachcomputational segmenting table, F_(j)(m) is an overall homogeneity scorefunction value for a j-way split of m sequential data points (y₁, Y₂, .. . y_(m)); or F_(j)(m) does not correspond to a split and the value ofF_(j)(m) is “undetermined”.

[0355] For each computational segmenting table, each F_(j)(m)corresponds to one and only one cell of the table. The value F_(j)(m)for any one cell is “elected” from the set or a subset of “candidatescores” for the cell. The value of F_(j)(m) for any one (or each) cellof the table is the elected score value for the (or each) cell. So, F(m) is the elected score value for the cell (to which F_(j)(m)corresponds).

[0356] Election of an F_(j)(m) value to be the elected value for a cellof a computational segmenting table. For each computational segmentingtable, each F_(j)(m) corresponds to one and only one cell of the table.Each segmenting algorithm determines a value of F_(j)(m) for each cellof a table that characterizes the algorithm. Each value of F_(j)(m) isdetermined (or elected) so that each value of F_(j)(m) is (1) a(definite) optimal score value, (2) an approximate (definite) optimalscore value, (3) a probable optimal score value, (4) an approximateprobable optimal score value, or (5) a statistically meaningful value (avalue that corresponds to a statistically meaningful split). If F_(j)(m)is not reliably or reasonably described by one (or more) of thecategories (1) through (5), then F_(j)(m) does not correspond to a splitand F_(j)(m) is assigned the value (6) “undetermined”.

[0357] Each of the categories (1)-(6) in the above paragraph is anelection category. For each value of F_(j)(m), the election category ofany one F_(j)(m) is the lowest number category (1)-(6) which reliably orreasonably describes the F_(j)(m). Put another way, the electioncategory of each F_(j)(m) is the lowest number category (1)-(6) whichreliably or reasonably describes each F_(j)(m). For example, in theHawkins DP algorithm, F_(j)(m) is determined using the relationF_(j)(m)=min {F_(j−1)(v)+r(v+1, m)}, wherein v takes on each possiblevalue between j−1 and m−1; (j−1≦v≦m−1). And in the Hawkins DP algorithm,each F_(j)(m) is a definite optimal value for its cell. In FSA TeachingExample 1, only proper subsets of candidate scores are calculated forsome cells and each F_(j)(m) of each such cell is a probable optimalvalue. In FSA Teaching Example 1, the set of all candidate scores iscalculated for the reference cell corresponding to F₂(n) and the electedvalue F₂(n) is a definite optimal (minimal) value.

[0358] As described above, some cells of Table 1 are empty or have a “0”in them due to the fact that m<j. A cell of a computational segmentingtable is always essentially empty with undetermined F_(j)(m) value, whenm<j. A cell wherein m<j is an impossible cell.

[0359] A computational segmenting table that characterizes the operationof a segmenting algorithm is a table that includes details of theoperation of the algorithm to obtain each piece of information used tofind the endpoints (or changepoints, or cutpoints) of each segment ofthe k-way split made by the algorithm.

[0360] Candidate score values of a cell of a computational segmentingtable. For each computational segmenting table, each value ofF_(j−1)(v)+r(v+1, m) in a cell is a possible candidate score value to bethe elected score value F_(j)(m). The candidate score valueF_(j−1)(v)+r(v+1, m) is denoted C_(j)(v,m). C_(j)(v,m) is the score orscore value (overall measure of homogeneity) for a j-way split on m datapoints (y₁, y₂. . . , y_(m)), wherein the last segment of the splitincludes only the points v+1 to m. F_(j−1)(v)+r(v+1, m)=C_(j)(v,m). InFSA Teaching Example 1, the optimal candidate score for the cell thatcorresponds to F₂(n) (in a table similar to Table 1) is chosen to be theelected score value for the cell.

[0361] The set of (all) candidate score values of a cell of a table isthe set of all possible values of C_(j)(v, m), where v takes on eachvalue from j−1 to m−1, (j−1 ≦v≦m−1).

[0362] A subset of the set of all candidate score values of a cell of atable is a set of possible values of C_(j)(v, m), wherein v takes on oneor more of the values from j−1 to m−1, (j−1≦v≦m−1). The term a subset ofthe set of all candidate score values of a cell is sometimes abbreviatedas a subset of possible candidate scores for a cell, subset of candidatescores for a cell, subset of all possible scores or similar language.(Unless stated otherwise, in this patent application the term subset ofa set means the set itself or a proper subset of the set. A propersubset of a set is a subset (of the set) wherein at least one member ofthe set is not a member of the (proper) subset.)

[0363] A subset of the c best values of the set of candidate scorevalues for a cell is a proper subset that contains the c most optimalscores of the set of all possible candidate scores (for the cell). Sucha subset is a best score subset of the cell, and the number c is thesize of the best score subset.

[0364] A candidate score within a cell that is a member of a selectedbest score subset is a best score (or best candidate score) for thecell.

[0365] A subset of c approximate best values of the set of candidatescore values for a cell is a proper subset that contains c candidatescores of the cell, wherein the c scores are approximately the c mostoptimal scores of the set of all possible candidate scores (for thecell). Such a subset is an approximate best score subset of the cell,and the number c is the size of the approximate best score subset.

[0366] A candidate score within a cell that is a member of a selectedapproximate best score subset is an approximate best score (or anapproximate best candidate score) for the cell.

[0367] Equal level (or same level) candidate score values of adjacentcells in a row of a computational table. In Table 1, the candidate scorevalue expressions F₂(2)+r(3,4) and F₂(2)+r(3,5) are at the same level(or equal levels) of two adjacent cells in the same row of the table. (F₂(2)+r(3,5)=C₃(2,5) and F₂(2)+r(3,4)=C₃(2,4)). Similarly, given the twocandidate score values C_(j)(v,m) and C_(j)(v,m+1), the two candidatevalues are in adjacent cells of the same row. These two values are equallevel values. In terms of calculation,

[0368] C_(j)(v,m+1)−C_(j)(v,m)=r(v+1, m+1)−r(v+1, m). So these twovalues differ from each other by only one data point (or observation) inthe expression for r( ). That data point is y_(m+1).These two values arerelated by the fact that it is possible to calculate each value of thepair from the other value of the pair by using the dynamic programmingtechnique of running sums (or similar technique that performs the samefunction). This calculation is done by adding or removing the data pointy_(m+1) from the calculation. Similarly, the two candidate valuesC_(j)(v,m−1) and C_(j)(v,m) are equal level values. Each candidate valueof a cell of a table has either one or two equal level candidate valuesin one or two adjacent cells (respectively) of the same row. Equal levelvalue pairs in adjacent cells are related in that it is possible tocalculate each value of the pair from the other value of the pair byusing running sums (or a similar technique) and adding or removing thesame data point from the calculation. (see FSA Teaching Example 1 andF₂(2)+r(3,5)=C₃(2,5) and F₂(2)+r(3,4)=C₃(2,4) as an example of equallevel candidate scores.)

[0369] Some possible routes of calculation for a candidate score valuein a cell. It is possible to calculate each candidate score valueC_(j)(v,m) in different ways, using different “routes”. Using theequation, C_(j)(v,m)=F_(j−1)(v)+r(v+1, m) for example, it is possible tocalculate C_(j)(v,m) from either C_(j)(v,m+1) or C_(j)(v,m−1) usingrunning sums and removing or adding a data point. Such a calculation isa horizontal calculation, in that the candidate value has beencalculated from other candidates in the same row (in this case also atthe same level). A horizontal calculation has a direction, to the rightwhen a data point is added to C_(j)(v,m−1) to obtain C_(j)(v,m), and tothe left when a data point is removed from C_(j)(v,m+1) to obtainC_(j)(v,m). So there are horizontal rightward and leftward calculations.Some other routes are vertical. For example, using the equation,C_(j)(v,m)=F_(j−1)(v)+r(v+1, m), it is possible to calculate C_(j)(v,m)when F_(j−1)(v) is known by calculating r(v+1, m). The direction of thecalculation is downward in that C_(j)(v,m) is calculated usingF_(j−1)(v) results from a row above. For example, in some preferredembodiments of a Fast Segmenting Algorithm, all values of C₂(v,n) arecalculated from known values of F₁(v) using vertical calculations (seeFSA Teaching Examples 1 and 2). In this patent application, the term FSAis sometimes used in place of fast segmenting algorithm.

[0370] Equal level (or same level) candidate score values of separatedcells in a row of a computational table. As noted above,F₂(2)+r(3,5)=C3(2,5) and F₂(2)+r(3,4)=C₃(2,4) in Table 1 are equal levelcandidate scores of adjacent cells (of the same row). Similarly thecandidate scores F₂(2)+r(3,5)=C₃(2,5) and F₂(2)+r(3,3)=C₃(2,3) in Table1 are same level candidate scores of separated cells (of the same row).In terms of calculation, C₃(2,5)−C₃(2,3)=r(3,5)−r(3,3), so these twovalues differ from each other by only two data points (y₄ and y₅) in theexpression for r( ).

[0371] Similarly, C_(j)(v,m+2) and C_(j)(v,m) are same level candidatescores of separated cells. And, in terms of calculation,C_(j)(v,m+2)−C_(j)(v,m)=r(v+1, m+2)−r(v+1, m). So these two valuesdiffer from each other by only two data points (or observations) in theexpression for r( ). The two data points are Y_(m+1) and Y_(m+2). Thesetwo values are related by the fact that it is possible to calculate eachvalue of the pair from the other value of the pair by using the dynamicprogramming technique of running sums (or similar technique thatperforms the same function). This calculation is done by adding orremoving the data points y_(m+1) and Y_(m+2) from the calculation.

[0372] Generalizing, C_(j)(v, m+g) and C_(j)(v, m) are same levelcandidate scores of separated cells (of the same row), g≧2. These twovalues are related by the fact that it is possible to calculate eachvalue of the pair from the other value of the pair by using the dynamicprogramming technique of running sums (or similar technique thatperforms the same function). This calculation is done by adding orremoving the data points y_(m+1), y_(m+2), . . . , y_(m+g) from thecalculation. The candidate scores C_(j)(v, m-g) and C_(j)(v, m) are samelevel candidate scores of separated cells and have equivalentcharacteristics in terms of calculation.

[0373] Horizontal skip calculations. As noted above, C_(j)(v, m+g) andC_(j)(v, m), g≧2, are same level candidate score values of separatedcells. And it is possible to calculate each value of the pair from theother value of the pair by adding or removing the data points y_(m+)1,y_(m+2), . . . , y_(m+g) from the dynamic programming (DP) calculation.Such a dynamic programming calculation does not require calculation orstorage of the values C_(j)(v, m+1), C_(j)(v, m+2), . . . , C_(j)(v,m+g−1) when calculating C_(j)(v, m+g) or C_(j)(v, m). Such a dynamicprogramming calculation essentially skips the values C_(j)(v, m+1),C_(j)(v, m+2), . . . , C_(j)(v, m+g−1). This calculation is horizontalin orientation, but essentially skips equal level candidate score valuesin the calculation. Such a DP calculation is a horizontal skipcalculation.

[0374] The number g−1 is the skip number of the horizontal skipcalculation. For a true horizontal skip calculation, the skip number isgreater than or equal to 1. The skip number is zero when a horizontalcalculation calculates a candidate value using a same level candidatevalue in an adjacent cell. When the skip number of a horizontalcalculation is zero, the horizontal calculation is a nonskip horizontalcalculation. Like nonskip horizontal calculations, each horizontal skipcalculation has a rightward or leftward direction. Like nonskiphorizontal calculations, it is possible to use horizontal skipcalculations recursively. One or more versions of FSAs use one or morehorizontal skip calculations and zero or more horizontal nonskipcalculations recursively.

[0375] Calculating candidate score values at an equal level of a rowrecursively. As described above, versions of an FSA calculate one ormore candidate score values at the same level of a row by using ahorizontal calculation (in one direction) recursively. Similarlyversions of an FSA calculate one or more candidate score values at thesame level of a row using one or more horizontal calculations (in one orboth directions) recursively.

[0376] Some versions of FSAs calculate one or more same level (same row)candidate score using a horizontal nonskip calculation (in onedirection) recursively. (see for example, FSA Teaching Examples 1 and 2)

[0377] The length of a recursive horizontal nonskip calculation thatcalculates one or more same level candidate scores in an unbroken chainof adjacent row cells is the number of row cells in the unbroken chain.

[0378] A same level horizontal score string is a group of one or moresame level candidate scores of an unbroken chain of adjacent same rowcells, wherein each same level candidate score is calculated by anidentical recursive horizontal nonskip calculation. (The identicalcalculation, is of course, unidirectional.) The length of a same levelhorizontal score string is the number of scores in the string.

[0379] Reference cells of a computational segmenting table. For the fastsegmenting algorithm in FSA Teaching Example 1, all of the possiblecandidate values in the rightmost cell of the second row of the table(corresponding to F₂(n)) are calculated and are used to determine (orelect) F₂(n). A selection algorithm (or similar algorithm) is also usedto select a subset of the best {square root}n candidate scores in thecell. In like manner, all of the possible candidate values arecalculated and the best {square root}n scores selected in some cells ofthe table as described in 4. of FSA Teaching Example 1. (Thecalculations in 4. of FSA Teaching Example 1 for a cell is essentially a“recomputation” of all candidate scores.) Each cell for which all of thecandidate scores are calculated and a best score subset selected in FSATeaching Example 1 is a reference cell. (A reference cell mayessentially be conceptualized (for versions of FSAs) as areference-point from which further calculations in neighboring cellsbegin.)

[0380] Similarly, there are zero or more reference cells in a tablecharacterizing versions of FSAs. A reference cell of an FSA is any cellwith the following two characteristics. (1) A large number of thepossible candidate score values, C_(j)(v,m), is computed in (or for) thecell. (2) And a proper subset of the best or approximately best scoresof the large number computed for the cell is selected. The term largenumber includes (but is not limited to) (a) all, (b)essentially all,(c)a high percentage, (d)most, (e) a random sample of the set of allcandidate scores in the cell or (f)a statistically suitable number ofthe set of all possible candidate values for a cell. (The term“statistically suitable number” here means a number great enough thatthere is a reasonable or high probability that an F_(j)(m) value for thecell is determinable using the number. A value of F_(j)(m) that isdeterminable is one that is reliably or reasonably described by one ormore of the election categories (1)-(5). In some cases, the magnitude ofa statistically suitable number depends on the data point values y₁, y₂,. . . , y_(n).)^(XVI)

[0381] Selecting a best score subset or an approximate best score subsetfor a reference cell. As described in the FSA Teaching Example 1, a bestscore subset is chosen from the set of all possible candidate scores foreach reference cell, using a selection algorithm or an equivalentthereof. (The size of each best score subset in FSA Teaching Example 1is {square root}n). Similarly, versions of FSAs select a best scoresubset or an approximate best score subset for each reference cell in atable that characterizes the operation of these versions of FSAs. Thesize of a best (or approximate best) score subset is the number ofscores in the subset.

[0382] Using a best score subset or an approximate best score subset ofa reference cell to calculate candidate scores for nearby cells usingone or more horizontal calculations. Each of one or more versions ofFSAs selects a best or an approximate best score subset in eachreference cell of a table that characterizes each of the one or moreversions of FSAs. And each of the one or more versions of FSAs uses oneor more of the scores in the selected subset to form a horizontal startsubset. Each score in the start subset is then used (by these FSAs) tocalculate one or more candidate scores in nearby cells of the same rowwith one or more horizontal calculations.

[0383] Such a version of an FSA is described in each of FSA TeachingExamples 1 and 2. In these examples, the set of all possible candidatescores is calculated for each reference cell. A best score subset ofsize c is selected for each reference cell. And these c scores are usedto in one or more recursive (leftward) horizontal calculations tocalculate c of the candidate scores in each of one or more same rowcells of a table (similar to Table 1). (So that each horizontal startsubset has c scores as members of each start subset.) The length of eachof these recursive horizontal nonskip calculations is essentially ccells. The number c is {square root}n and log n respectively forExamples 1 and 2.

[0384] The size of a horizontal start subset is the number of scores inthe subset. In FSA Teaching Examples 1 and 2, the size of eachhorizontal start subset is equal to c, the size of the best score subsetselected for each reference cell.

[0385] Candidate score values that originated with a (score of)horizontal start subset (of a reference cell). Versions of FSAs use oneor more recursive horizontal (skip or nonskip, or skip and nonskip)calculations to calculate one or more candidate scores in one or moresame row cells of a table. Each such calculation starts with a score ofa horizontal start subset. A candidate score that is calculated using ahorizontal calculation that started with, or crossed, a score of ahorizontal start subset is a score that a horizontal start subset.Alternatively such a candidate score originated with a score of ahorizontal start subset.

[0386] Size limited cells. As seen from examining Table 1, the size ofthe set of all possible candidate scores in a cell decreases the furtherthe cell is to the left of the table. For example, the cellcorresponding to F₂(3) has only two candidate scores. For a cellcorresponding to F_(j)(m) the number of candidate scores in Table 1 ism−j+1. Similarly, the maximum number of possible candidate scores forany cell of any computational segmenting table is m−j+1. Defining thesize of a cell of a table as the total number of candidate scorescomputed in (or for) the cell, it is clear that each cell has a maximalpossible size. That maximal possible size is m−j+1. So each cell of acomputational segmenting table is limited to each cell's maximalpossible size, m−j+1.

[0387] FSA Teaching Examples 1 and 2 each describe preferred versions ofFSAs. In each of these FSA versions, a subset of c best candidate scoresis selected in each reference cell. And c scores are calculated in eachof one or more same row cells using one or more recursive horizontalnonskip calculations. However, it is impossible for each of one or more(size limited) cells to have c candidate scores calculated for each ofthe one or more cells. This is because, m−j+1<c for those cells.

[0388] Completeness of a Computational Segmenting Table. It is notnecessary that a computational segmenting table be complete in order tocharacterize the operation of a segmenting algorithm. For example, forsome versions of an FSA, all of the candidate values for F_(k)(n) arecalculated using a vertical calculation. For some such-versions of anFSA, the cells corresponding to j=k (last row)-and m<n are empty andF_(j)(m) is undetermined. This is because there is no need to calculatecandidate scores (or determine F_(j)(m)) in those cells in order todetermine F_(k)(n).

[0389] Void cells. For some versions of FSAs, computational segmentingtables that characterize the operation of the FSAs have an empty cell(or an undetermined value for F_(j)(m) ) even when j<k and m≧j. Such acell is a void cell. For a void cell, F_(j)(m) is undetermined for aparticular value of j and m. This means that whatever the final k-waysplit of the n data points (done by such an FSA), the final split doesnot include a j-way split of the m data points y₁, y₂, . . . , y_(m) forthe particular values of j and m. Similarly when one of more void cellsis present in a table, the final k-way split of the n data points (doneby an FSA characterized by the table), does not include a j-way split ofthe m data points y₁, y₂, . . . , y_(m) for the particular values of jand m that correspond to the void cells. (When a void cell is present ina table, then C_(j)(v,m) is not calculated for any value of v for theparticular values of j and m corresponding to the void cell.)

[0390] Fast cells of a computational segmenting table. A cell of acomputational segmenting table wherein a proper subset of all possiblecandidate scores is computed using one or more horizontal calculationsis a fast cell of the table. In addition, the total group of candidatescores computed (by any means) for a fast cell is also a proper subsetof the set of all possible candidate scores. So that a cell wherein allpossible candidate scores are computed is not a fast cell. An example ofa fast cell is the cell corresponding to F₂(n−1) in FSA Teaching Example1.

[0391] For some versions of Fast Segmenting Algorithms, a cell is a fastcell and a reference cell. A fast cell that is also a reference cell isa fast reference cell. A fast cell that is not a reference cell is asimple fast cell. A fast cell wherein one or more of the candidatesscores calculated for the cell originated with a score of a horizontalstart subset (of a reference cell) is a select fast cell. A select fastcell that is not reference cell is a simple select fast cell,abbreviated ss fast cell.

[0392] A string of same origin candidate scores is a sequence of equallevel candidate scores in an unbroken chain of adjacent row cells,wherein each candidate score in the sequence originated with the samescore of a horizontal start subset. The length of a string of sameorigin candidate scores is the number of scores in the string. Thedirection of a string is the direction of the horizontal recursivecalculation that generated the string. The first score of a string isthe first score in the sequence of candidate scores. The cell thatcontains the first score in the string is the first cell of the string.And the last score of a string is the last score in the sequence. Thecell that contains the last score in the string is the last cell of thestring.

[0393] A chain of same origin simple select fast cells. An unbrokenchain of adjacent same row ss fast cells, wherein each cell has acandidate a score that originated with the same score of a horizontalstart subset is a chain of same origin simple select fast cells. A chainof same origin simple select fast cells is abbreviated as-a chain ofsoss fast cells. (One or more chains of soss fast cells are described inFSA Teaching Example 1. The length of each of these chains isessentially {square root}n cells.) The length of an soss fast cell chainis the number of cells in the chain. A chain of soss fast cells isessentially an unbroken sequence of adjacent row cells, wherein thesequence of cells contains one or more strings of same origin candidatescores.

[0394] The size of a cell. The size of a cell is the total number ofcandidate score values computed in the cell. The size of a cell ismaximal if the set of all possible candidate scores is computed for thecell. The size of a reference cell is the total number of candidatescore values computed in the cell. For example, in the FSA describedunder FSA Teaching Example 1, the sizes of the cells corresponding toF₂(n) and F₃(n), are n−1 and n−2 respectively. For versions of fastsegmenting algorithms, each particular reference cell has a particularsize. So for versions of FSAs, it is possible for two differentreference cells to have two different sizes.

[0395] The size of a fast cell is the total number of candidate scorevalues computed in the cell. For some versions of fast segmentingalgorithms, each particular fast cell has a particular size. So for someversions of FSAs it is possible for two different fast cells to have twodifferent sizes. For example, in the FSA described under FSA TeachingExample 1, essentially all of the fast cells have the same size, thatsize is {square root}n.

[0396] The horizontal size and vertical size of a cell. Each cell has asize. The size of each such cell is further divided into a horizontalsize and a vertical size. The horizontal size of a cell is the totalnumber of candidate score values computed in the cell using a horizontalcalculation. For example, in the FSA described under FSA TeachingExample 1, essentially all of the fast cells have the same horizontalsize, that horizontal size is {square root}n. (The vertical size of eachof the fast cells is zero, because all of the candidate scores of eachfast cell is computed using a horizontal calculation.)

[0397] The vertical size of a cell is the total number of candidatescore values computed in the cell using a vertical calculation. Forexample, in the FSA described under FSA Teaching Example 1, the verticalsize of the cell corresponding to F₂(n) is n−1. (The horizontal size ofthe cell corresponding to F₂(n) is zero, because all of the candidatescores in the cell are computed using a vertical calculation.)

[0398] A directional rectangle of same origin candidate scores is agroup of one or more strings of same origin candidate scores of equallength, wherein each string in the group has the same first cell andeach string in the group has the same last cell. A candidate score in astring of the group, is a score in, within or contained in therectangle. The length of the rectangle is the number of scores in eachscore string (in the rectangle). And the width of the rectangle is thenumber of strings in the group. The first cell and last cell of therectangle is respectively the first cell and last cell of each string inthe group. A first or last cell of the rectangle is an end cell of therectangle. If each score in each string of the group originated with ascore of a horizontal start subset of one (same) reference cell, thenthe rectangle arises from the reference cell. If the last cell of therectangle is adjacent to a reference cell, then the rectangle terminateson the reference cell. If the rectangle arises from a first referencecell and terminates on a second reference cell, then the rectangle iscompatible with the pair of reference cells (wherein the pair consistsof the first and second reference cells). The direction of the rectangle(leftward or rightward) is the direction of each string in therectangle. (As is seen from FSA Teaching Examples 1 and 2, a rectangleof same origin candidate scores that is compatible with a pair ofreference cells is a preferred rectangle.)

[0399] A cell block that fits a rectangle of same origin candidatescores is an unbroken chain of adjacent same row cells wherein eachscore in the rectangle is a score within a cell of the chain. And eachcell of the chain contains one or more candidate scores in therectangle. An end cell of the block is a first cell or last cell of therectangle. (It is possible for a cell block to fit two or morerectangles, wherein one or more of the rectangles have a differentdirection.)

[0400] A pair of nearest same row reference cells is a pair of referencecells (in the same row of a table) that has no reference cell (in thesame row of the table) between the pair.

[0401] An soss fast cell block is a chain of soss fast cells that fitsone or more rectangles of same origin candidate scores. If one or morethe rectangles is compatible with a pair of (nearest same row) referencecells, then the soss fast cell block is congruent with the referencecell pair. The length of the fast cell block is the length of the sossfast cell chain. In FSA Teaching Example 1, essentially each soss fastcell block is congruent with a (nearest same row) reference cell pair.(In contrast to a pure block, below, it is possible for a cell of ansoss fast cell block to contain one or more scores that are not in onerectangle. It is possible for the vertical size of one or more of thesoss fast cells in the chain to be greater than zero. It is possible fortwo different fast cells in the chain to have different horizontalsizes.)

[0402] A pure soss fast cell block is a chain of soss fast cells,wherein each cell of the chain contains only one or more candidatescores in a largest rectangle of one or more rectangles (of same origincandidate scores, of the same direction and length). And each candidatescore in the largest rectangle is contained in a cell of the chain ofsoss fast cells. A candidate score that is not contained in the largestrectangle is not a cell of the block. If the largest rectangle iscompatible with a (nearest same row) reference cell pair, then the puresoss fast cell block is congruent with the reference cell pair.Expressed another way, a pure soss fast cell block is essentially ansoss fast cell chain that contains only one or more strings of sameorigin candidate scores, and each of the strings has the same directionand length. Each cell of the soss fast cell chain has the samehorizontal size. And the vertical size of each cell of the chain iszero. And each pair of same level candidate scores (of two cells in thesoss fast cell chain) are part of the identical horizontal score string.The length of the fast cell block is the length of the soss fast cellchain. The width of the block is the horizontal size of each cell of thesoss fast cell chain. The last and first cells of the block are the lastand first cells respectively of a rectangle, wherein the block fits therectangle. (In FSA Teaching Examples 1 and 2, the length and width ofeach pure soss fast cell block are essentially equal (and equal thenumber c). In these Examples, essentially each pure soss fast cell blockis congruent with a (nearest same row) reference cell pair.)

[0403] The interval length between a pair of nearest same row referencecells is the number of cells in the same row that are between the tworeference cells of the pair. For example, in the fast segmentingalgorithm described under FSA Teaching Example 1, the interval lengthbetween a pair of nearest same row reference cells is essentially{square root}n. In fact, in FSA Teaching Example 1, the interval lengthbetween essentially all pairs of nearest same row reference cells is{square root}n. So that for this example of versions of fast segmentingalgorithms, the reference cells occur periodically (or essentiallyperiodically), with a period of essentially {square root}n (cells).

[0404] More on the election (or determination) of an F_(j)(m) value.Each elected value F_(j)(m) is determined using one or more candidatescore values. And each of one or more FSAs calculate each candidatevalue using a horizontal or a vertical calculation. It follows that eachof one or more FSAs determine each of one or more F_(j)(m) values usinga combination of one or more vertical calculations combined with one ormore horizontal calculations.

[0405] Determination (or election) of F_(k)(n). F_(k)(n) is an overallscore function value for a k-way split of n sequential data points (y₁,y₂, . . . , y_(n)). F_(k)(n) is a score that is (1) a (definite) optimalscore value, (2) an approximate (definite) optimal score value, (3) aprobable optimal score value, (4) an approximate probable optimal scorevalue, or (5) a statistically meaningful value (a value that correspondsto a statistically meaningful split). If F_(k)(n) is not reliably orreasonably described by one (or more) of the categories (1) through (5),then F_(k)(n) does not correspond to a split and F_(k)(n) is assignedthe value (6) “undetermined”.

[0406] A computational segmenting table that characterizes some versionsof segmenting algorithms that determine F_(k)(n) has essentially onlyj−1 complete rows. In some such cases, for example, F_(k)(n) is electedfrom candidate values that have been calculated using a verticalcalculation with one or more F_(k−1)(m) values previously generated.

[0407] A computational segmenting table that characterizes some versionsof segmenting algorithms that determine F_(k)(n) has essentially jcomplete rows. In some such cases, for example, F_(k)(n) is elected fromcandidate values that have been calculated using one or more horizontalcalculations with one or more F_(k)(m) values previously generated,wherein m≦n−1.

[0408] In addition, for some other versions of segmenting algorithms,F_(k)(n) is determined using a combination of vertical and horizontalcalculations.

[0409] Storing or recording information on the candidate score value ina cell that is determined to be F_(j)(m). A candidate score valueC_(j)(v^(Λ),m) for a particular value of v, v^(Λ), is determined by asegmenting algorithm to be F_(j)(m). In such a case,C_(j)(v^(Λ),m)=F_(j)(m)=F_(j−1)(v^(Λ))+r(v^(Λ)+1, m). By recording (orstoring) the value v^(Λ), or v^(Λ)+1, or an equivalent value, atraceback procedure is used determine one or more endpoints of thesegment(s) of the split associated with F_(k)(n). Inherent in anycomputational segmenting table is the storage or recording of values ofv^(Λ) for cells of the table. (It is possible to conceptualize thestorage of values of v^(Λ) (or an equivalent) in cells of acomputational segmenting table or in a corresponding table.)

[0410] Utility of versions of FSAs wherein a table that characterizeseach of the FSAs includes one or more void cells or one or morehorizontal skip calculations.

[0411] Some versions of FSAs essentially use one or more void cells orone or more horizontal skip calculations in segmenting. When a void cellis present in a table, then C_(j)(v,m) is not calculated for any valueof v for the particular values of j and m corresponding to the voidcell. Given a level of a cell of a table that corresponds to particularvalues of v, j and m, when a horizontal skip calculation skips the level(of the cell), then no candidate score C_(j)(v,m) is calculated for theparticular values of v, j and m. When an FSA essentially uses one ormore void cells or one or more horizontal skip calculations insegmenting, then particular component splits are eliminated from beingpart of the final k-way split chosen by the FSA. In some cases (such asfor certain types of data points) the elimination of one or morecomponent splits does not cause difficulties. For example, if the numberof data points, n, is very large and the number of segments, k, is muchsmaller than the number of data points (such as in FIG. 5), and thetransition between one or more “segments” is not abrupt, then an exactendpoint for each segment is frequently not critical. In some suchsituations, a k-way split by an FSA that uses one or more void cells orone or more horizontal skip calculations has increased speed withoutsignificantly sacrificing meaningfulness of the splits found by the FSA.This is one nonlimiting example.

[0412] Utility of versions of FSAs wherein each FSA employs a measure ofsegment homogeneity that uses an approximate mean.

[0413] As noted by examining equations 3, 4 and 5, some measures ofsegment data point homogeneity employ deviation from an exact or anapproximate mean. A nonlimiting example wherein a measure of segmentdata point homogeneity that employs deviation from an approximate meanhas utility is seen in the following situation. As noted above, versionsof FSAs use a horizontal calculation to calculate C_(j)(v,m) fromC_(j)(v,m+1). This is done using the equationsC_(j)(v,m)=F_(j−1)(v)+r(v+1, m) and C_(j)(v,m+1)=F_(j−1)(v)+r(v+1, m+1).The two values C_(j)(v,m) and C_(j)(v,m+1) differ from each other byonly one data point (or observation) in the expression for r( ). Thatdata point is y_(m+). By removing y_(m+1) from the expression for r(v+1,m+1), C_(j)(v,m) is calculated. However the expressions r(v+1, m) andr(v+1, m+1) use different means. By using the mean of the data pointsy_(v+1), Y_(v+2), . . . , Y_(m+1) (the mean used in the expression forr(v+1, m+1)) as an approximate mean for the data points y_(v+1),Y_(v+2), . . . , y_(m) in the expression for r(v+1, m) is obtained. Thisapproximate value is likely to be meaningful if the number of datapoints, n and m-v, is very large and the number of segments, k, is muchsmaller than the number of data points (such as in FIG. 5). Moreover,the use of the approximate mean in the expression for r(v+1, m) saves acalculation and increases speed.

[0414] Description of Versions of FSAs

[0415] The operation of each Fast Segmenting Algorithm is characterizedby a computational segmenting table. Each table characterizing an FSAhas one or more fast cells in the table. (In contrast, a tablecharacterizing a Hawkins DP algorithm has no fast cells in the table.)Some versions of FSAs have one or more reference cells in a tablecharacterizing each of these versions of FSAs. Some versions of FSAshave one or more candidate scores in one or more fast cells (of a tablecharacterizing each of these versions of FSAs), wherein each of the oneor more candidate scores originated with a score of a horizontal startsubset.

[0416] A General Description of Versions of Fast Segmenting Algorithms

[0417] A general description of versions of FSAs is given below. (Thisgeneral description is not necessarily limiting. There are other generaldescriptions of versions of FSAs that are supported by the subjectmatter contained herein.) This general description uses a computationalsegmenting table that characterizes the operation of the generallydescribed versions of FSAs, wherein the table has one or more fastcells.

[0418] 1) A computer-based segmenting algorithm (or method) for findingone or more k-way splits of n data points (in sequential order),comprising:

[0419] calculating a subset of the candidate score values for each ofone or more cells of a computational segmenting table, wherein theoperation of the algorithm is characterized by the table, wherein one ormore cells of the table is a fast cell (, wherein only a proper subsetof the set of all candidate scores is calculated in each fast cell, andeach of one or more of the candidate scores calculated in each fast cellis calculated using a horizontal calculation);

[0420] electing a score value F_(j)(m) for each cell in the table usingzero or more candidate score values in each cell, so that the scorevalue F_(j)(m) for each cell of the table is reliably or reasonablydescribed by one or more election categories;

[0421] determining F_(k)(n) using one or more values of F_(j)(m) fromthe table, so that F_(k)(n) is reliably or reasonably described by oneor more election categories; and

[0422] determining a k-way split of the n data points using F_(k)(n),the table and a traceback procedure to find one or more endpoints of thesplit.

[0423] Some specific versions of Fast Segmenting Algorithms are alsodescribed in terms of attributes of a computational segmenting tablethat characterizes the operation of an FSA. A computational segmentingtable that characterizes the operation of a segmenting algorithm is atable that includes details of the operation of the algorithm to obtaineach piece of information used to find the endpoints (or changepoints,or cutpoints) of each segment of the k-way split made by the algorithm.Such details include FSA attributes. When a table that characterizes anFSA has one or more attributes, the FSA is said to have the attributes.These FSA attributes include (but are not necessarily limited to) thefollowing attributes:

[0424] (1) the measure of segment homogeneity, r( ) in the table

[0425] (2) the measure of overall homogeneity, F( ) for each cell of thetable

[0426] (3) the manner in which each candidate score value C_(j)(v,m)(used to determine each value of F_(j)(m) was calculated), thecalculation route that led to each such candidate score value, the typeof calculation that led to each such candidate score value

[0427] (4) the election category for each F_(j)(m) value of each cell ofthe table

[0428] (5) the location of each reference cell of the table (and thetotal number of reference cells in the table)

[0429] i) whether a best or approximate best score subset is selectedfor each reference cell; the size of each horizontal start subset

[0430] ii) the size of each best or approximate best score subsetselected

[0431] iii) the size of each horizontal start subset

[0432] iv) the interval length between each pair of nearest same rowreference cells

[0433] (6) the location of each fast cell of the table (and the totalnumber of fast cells in the table)

[0434] i) the horizontal size of each fast cell

[0435] ii) the location of each simple fast cell, and the number ofsimple fast cells

[0436] iii) the location of each simple select fast cell and the numberof simple select fast cells

[0437] (7) the number and distribution of void cells in the table

[0438] (8) the size of each cell of the table

[0439] i) the vertical size of each cell of the table

[0440] ii) the horizontal size of each cell of the table

[0441] (9) the length, direction, starting cell and level of eachhorizontal recursive calculation

[0442] (10) the length, direction, starting cell, level and skip numberof each horizontal skip calculation

[0443] (11) the number and length of each soss fast cell chain

[0444] (12) the number of soss fast cell blocks, the number of pure sossfast cell blocks

[0445] (13) the length and width of each block; the first and last cellsof each block

[0446] (14) the number of directional rectangles of same origincandidate scores, the length, width, and first and last cells of eachrectangle.

[0447] (15) the number and location of pseudoreference cells

[0448] Possible FSAs. The following are descriptions of some possibleversions of FSAs. An FSA need only have one fast cell. It is possiblefor an FSA to have any combination of attributes or characteristicsdescribed herein, as long as the FSA has one fast cell. It is possiblefor an FSA to have no reference cells or pseudoreference cells. It ispossible for the spacing between pairs of nearest same row referencecells of an FSA to be different. It is possible for the sizes of best orapproximately best score subsets of reference cells of an FSA to bedifferent. The spacing of the reference cells of an FSA need not beperiodic or essentially periodic. For some FSAs the spacing of one ormore pairs of nearest same row reference cells is other than the size ofrespective horizontal start subsets or best score subsets of one or bothof the cells of the pair. Candidate scores in one or more of the samecells of some FSAs are calculated using both right and left horizontalcalculations. An FSA need not contain a fast cell block. An FSA withfast cell blocks need not contain a pure fast cell block. It is possiblefor the length and width of each pure fast cell block of an FSA to besubstantially different. There are numerous other possible combinationsthat characterize other FSAs.

[0449] Some preferred versions of fast segmenting algorithms. Versionsof FSAs as described in FSA Teaching Examples 1, 2 and 3 are preferredversions of FSAs. Versions of FSAs that are similar to the versionsdescribed in FSA Teaching Examples 1, 2, or 3 are preferred versions ofFSAs. Generally, the closer (or more similar) a version of an FSA is toone or more of the versions of FSAs described in FSA Teaching Examples1, 2, and 3 the more preferred the version of FSA. In addition,preferred versions of FSAs have one or more of the following preferredFSA attributes. Generally, the more of the preferred attributes an FSAhas, the more preferred the FSA.

[0450] Preferred FSA attributes

[0451] (1) preferred measures of segment homogeneity, r( ), aredescribed above.

[0452] (2) preferred measures of overall homogeneity, F( ), aredescribed above.

[0453] (3) one or more preferred manners in which each candidate scorevalue C_(j)(v,m) is calculated are as follows. In preferred versions ofFSAs, each candidate score in a reference cell is calculated using ahorizontal or a vertical calculation. In preferred versions of FSAs,each candidate score in a fast cell is calculated using a horizontalcalculation.

[0454] (4) preferred election categories for each F_(j)(m) value of eachcell of a table characterizing a preferred FSA are as follows. The mostpreferred election category is (1) a (definite) optimal score value, andthe least preferred category is (6) “undetermined”. The lower the numberof the election category, the more preferred the category. The morecells in a table for which F_(j)(m) is reliably or reasonably describedby a more preferred category, the more preferred the FSA (or versions ofFSAs) characterized by the table.

[0455] (5) preferred locations and numbers of reference cells ofpreferred versions of FSAs. A preferred location for reference cells isa table is each cell of the rightmost column in a table characterizingan FSA. More on preferred locations and numbers of reference cells isgiven below.

[0456] i) a best score subset is more preferable than an approximatebest score subset

[0457] ii) for some preferred FSAs, the size of each selected best orapproximate best score subset is the same or about the same size, c, foreach preferred FSA (or table).

[0458] Preferred values for c are closest integer values of {squareroot}n and log n. It is possible for c to be any integer wherein c<n.Other values of c are closest integer values given by the followingequations:

c=n^(1/q)

[0459] or

c=log _(q) n,

[0460] wherein q≧1.

[0461] iii) a preferred interval length between each pair of nearestsame row reference cells is about the size (in terms of cells) of one orboth of the reference cells of each pair. When all of the referencecells of a table characterizing an FSA are of equal or about equal sizec, then a preferred interval length is c or about c. A preferred spacingof reference cells is essentially periodically in each table row. Apreferred period is c or about c cells.

[0462] (6) one or more preferred locations of one or more simple fastcells of a table characterizing one or more preferred FSAs is whereineach simple fast cell is one of an unbroken chain of same row simplefast cells that occur between nearest same row reference cells.

[0463] i) A preferred horizontal size for a simple fast cell is the sameor about the same size as a selected best or approximate best scoresubset of one or both reference cells of a pair of nearest same rowreference cells, wherein the simple fast cell is between the pair ofreference cells. A preferred horizontal size for a simple fast cell isthe same or about the same size as a selected best or approximate bestscore subset of a nearest same row reference cell. A preferredhorizontal size for each of one or more simple fast cells of a tablethat characterizes a preferred FSA is the size c (or about c), wherein cis the size of a selected best or approximate best score subset of eachreference cell of the table. Preferred values for c are {square root}nand log n.

[0464] (7) Generally a preferred value for the number of void cells in atable that characterizes an FSA is a small percentage of the totalnumber of cells in the table. Generally a more preferred value for thenumber of void cells in a table that characterizes an FSA is less thanabout 10% of the total number of cells in the table. Generally a mostpreferred value for the number of void cells in a table thatcharacterizes an FSA is less than about 1 %. And generally an optimalnumber of void cells is zero or about zero. A preferred distribution ofvoid cells is essentially periodic in terms of position in a row of atable.

[0465] (8) Some possible sizes for a reference cell are (a)all,(b)essentially all, (c)a high percentage, (d)most, or (e)a statisticallysuitable number of the set of all possible candidate values for a cell.A most preferred size for one or more reference cells of a table thatcharacterizes one or more preferred FSAs is (a) all, a least preferredsize is (e). Generally, the closer the size is to (a), the morepreferred the size for one or more reference cells.

[0466] i) a preferred vertical size of each of one or more simple fastcells of a table that characterizes one or more preferred FSAs is zeroor about zero

[0467] ii) information on preferred horizontal sizes for one or moresimple fast cells is given above

[0468] (9) a preferred starting point for each of one or more horizontalrecursive calculations is essentially a reference cell, a preferredlength for each of one or more horizontal recursive calculations isessentially the number of same row cells between two nearest same rowreference cells. Other starting points and lengths are possible.

[0469] (10) a preferred starting point for each of one or morehorizontal skip calculations is essentially a reference cell; generallysmaller skip numbers are preferred a preferred skip number is less than10% of n, a more preferred skip number is less than 1%, a most preferredskip number is zero. Skip numbers greater than 10% are possible.

[0470] (11) One or more soss fast cell chains are more preferred thanone or more isolated simple select fast cells.

[0471] (12) Soss fast cell blocks are preferred. Pure soss fast cellblocks are more preferred than soss fast cell blocks. One or more puresoss fast cell blocks of about equal length and width are preferred. Oneor more pure blocks wherein all of the pure blocks have approximatelyequal length and width are more preferred. Each pure block contains alargest rectangle. One or more pure blocks, wherein the length and widthof the largest rectangle (contained in each block) is about equal to thesize of a best or approximate best score subset of a reference cell fromwhich the rectangle arises are preferred. Pure blocks that are congruentwith a reference cell pair are preferred.

[0472] (13) One or more directional rectangles of same origin candidatescores, wherein each rectangle is compatible with a nearest same rowpair of reference cells are preferred. One or more rectangles ofapproximately equal length and width are preferred. One or morerectangles wherein all of the rectangles have approximately equal lengthand width are more preferred. One or more rectangles wherein the lengthand width of each rectangle is equal or approximately equal to the sizeof a best or approximate best score subset of a reference cell fromwhich each rectangle arises are preferred.

[0473] (14) Reference cells are preferred over pseudoreference cells.Although numerous preferred attributes have been listed, less preferredversions of FSAs have advantages in some situations. These situationsare often dependent on the types of data points being segmented.

[0474] The present patent application claims priority from U.S.provisional patent application No. 60/225113 filed Aug. 14, 2000 and allof the contents U.S. provisional application No. 60/225113 areincorporated herein by reference and to the fullest extent of the law.The present application is a CIP of PCT/US01/25519 (having the sametitle) filed Aug. 14, 2001 and PCT/US01/25519 is incorporated herein byreference in its entirety and to the fullest extent of the law. Thepresent application claims priority from U.S. provisional patentapplication No. 60/358631 filed Feb. 20, 2002 and all of the contentsU.S. provisional application No. 60/358631 are incorporated herein byreference and to the fullest extent of the law.

[0475]¹ Hawkins D M, Merriam D F, Optimal Zonation of DigitizedSequential Data. Mathematical Geology, vol 5, No. 4, 1973, pp. 389-394.

[0476]²Musser B J, Extensions to Recursive Partitioning. Doctoral thesis(October 1999) under the supervision of Professor Douglas M. Hawkins,School of Statistics, University of Minnesota, St. Paul, Minn. 55108USA.

[0477]³ Hawkins D M, On the Choice of Segments in PiecewiseApproximation. J. Inst. Maths Applics (1972) 9, 250-256.

[0478]⁴ Hawkins D M, Point Estimation of the Parameters of PiecewiseRegression Models. Appl. Statist. 1976), 25, No.1, p. 51.

[0479]⁵ Hawkins D M, Merriam D F. Zonation of Multivariate Sequences ofDigitized Geologic Data. Mathematical Geology, Vol. 6, No. 3, 1974.

[0480]⁶ Hawkins D M, ten Krooden J A Zonation of Sequences ofHeteroscedastic Multivariate Data. Computers & Geosciences. Vol 5, pp.189-194.

[0481]⁷ Hawkins D M. Computing Mean Vectors and Dispersion Matrices inMultivariate Analysis of Variance. Algorithm AS 72 Applied Statistics(Statistical Algorithms)

[0482]⁸ The FIRM Manual and Software are referred to herein as reference8.

[0483]⁹ Chapter 5: Automatic Interaction Detection by Hawkins and Kasspp. 269-302 in the book Topics in Applied Multivariate Analysis;Hawkins, D. H., Ed. Cambridge University Press is referred to herein asreference 9.

[0484]¹⁰ The book Cormen, T H; Leiserson, C E; and Rivest, R L (1990)Introduction to Algorithms, Cambridge, Mass.: MIT Press is incorporatedherein by reference. Of particular interest in this book is Chapter 10that deals with selection algorithms that are used in versions of theFast Segmenting Algorithm.

[0485]¹¹ Hawkins D M, Merriam D F, Optimal Zonation of DigitizedSequential Data. Mathematical Geology, vol 5, No. 4, 1973, pp. 389-394.see page 1. This published paper is incorporated herein by reference tothe fullest extent of the law.

[0486]¹² Ibid., page 390, under Notation and Method. Hawkins notation isequivalent, but he uses “N” in place of “n” and “x” in place of “y”. Sohis data points are x₁, x₂, . . . , x_(N) rather than the equivalentpoints y₁, y₂, Y₃, . . . , y_(n−2), y_(n−1), y_(n). His notation thenfor the endpoints (or cutpoints or changepoints) of the segments and forW is slightly different.

[0487]¹³ Ibid., page 391 first paragraph. Since the covering is anoptimal covering, it follows that W is a minimum for the covering. NowW=r(1, n₁)+r(n₁+1,n₂)+. . . +r(n_(k−2)+1, n_(k−1))+r(n_(k−1)+1, n)=[r(1,n₁)+r(n₁+1,n₂)+. . . +r(n_(k−2)+1, n_(k−1))]+r(n_(k−1)+1, n)=W^(Λ+r(n)_(k−1)+1, n). The expression in brackets which equals W^(Λ) is theoverall measure of homogeneity for a k-1 segment covering (or (k−1)-waysplit) for the data points from 1 to n_(k−1). It follows that this k−1segment covering is also an optimal covering for the data points from 1to n_(k−1), i.e. that W^(Λ) is also a minimum. (Hawkins uses the term “ksegment covering”. An equivalent expression is “k-way split” or“k-segment segmentation”.)

[0488]¹⁴ Ibid. page 391, first and second paragraphs. The simple searchand the development of the algorithm by using the simple searchrecursively are presented by Hawkins here. Again, Hawkins's notationdiffers slightly from what is used here and in provisional patentapplication No. 60/225113.

[0489]¹⁵ Ibid. page 391, third and fourth paragraphs. In addition to thealgorithm described above, Hawkins presents a slightly modifiedalgorithm that has an additional constraint that each of the segments ofan optimal split has a minimum width w. Another second procedure (oralgorithm) is presented when k is not known. In the second procedure, kis inferred by generating values of F₁(n), F₂(n), F₃(n), . . . , F₅(n)and selecting a value for k (1≦k≦s) for which an improvement in W orF_(k)(n) gained by adding segments is negligibly small.

What is claimed is:
 1. A computer-based Segmentation/RecursivePartitioning process or method for generating a nodal tree or equivalentdata structure and displaying the nodal tree on a monitor or equivalentdevice or placing the nodal tree in or on a computer readable medium ortransmission signal, wherein the Segmentation/Recursive Partitioningprocess uses one or more special dynamic programming segmentingalgorithms.
 2. A computer-based method as in claim 1, wherein the methodis for clarifying a relationship between a response and one or moredescriptors by generating a data structure, the response and eachdescriptor having a value for each data object of a group of n dataobjects, n being a positive integer number greater than 100, the datastructure being a nodal tree or an equivalent thereof, the root of thetree being the group of data objects, comprising: defining a nodaltree-node segmenting procedure (NT-NS Prcdr), comprising i), ii), iii),iv): i)choosing an unsegmented node that has not been previouslysegmented; ii) choosing a node segmentation process for the unsegmentednode; iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and iv)making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv); applying theNT-NS Prcdr to the root node first; applying the NT-NS Prcdr to zero ormore unsegmented nodes of the tree; and displaying the data structuregenerated as a nodal tree or equivalent thereof on a monitor orequivalent device, or placing the nodal tree in or on a computerreadable medium or transmission signal.
 3. A method as in claim 2,wherein each data object is a real-world object, and the response andeach descriptor value for each data object being real world data.
 4. Amethod as in claim 3, wherein the special NT-NS Prcdr is an FSA-specialNT-NS Prcdr, so that the FSA-special NT-NS Prcdr effectively uses one ormore FSAs.
 5. A method as in claim 4, wherein the FSA-special NT-NSPrcdr uses one or more FSAs.
 6. A method as in claim 5, wherein themethod operates by sending information or receiving information or acombination of sending and receiving information over a medium such asthe internet.
 7. A method as in claim 4, wherein each of the one or moreeffectively used FSAs has one or more reference cells, wherein a bestscore subset or an approximate best score subset is computed for eachreference cell.
 8. A method as in claim 5, wherein (1) each data objectis a molecular data object and each descriptor is a molecular descriptorand the response for each object is a molecular property or wherein (2)each data object is an individual creature or tissue from a creature andeach descriptor is a genetic makeup descriptor and the response for eachobject is a phenotypic characteristic.
 9. A method as in claim 5,wherein each data object is an individual creature or tissue from acreature and each descriptor is (a) a combination of one or moregenotypes at one or more polymorphisms or (b) a combination of one ormore alleles at one or more polymorphisms or (c) a combination of one ormore haplotypes, and the response for each object is a phenotypiccharacteristic.
 10. A method as in claim 5, wherein each FSA has one ormore reference cells, wherein a best score subset and horizontal startsubset is computed for each reference cell, and each of the FSAs has oneor more directional rectangles of same origin candidate scores, andwherein n is greater than
 250. 11. A method as in claim 5, wherein eachFSA has one or more reference cells, wherein a best score subset and ahorizontal start subset is computed for each reference cell, wherein thebest score subset and the horizontal start subset of each reference cellare the same and are the same size, wherein the size is c scores,wherein each of the FSAs has one or more directional rectangles of sameorigin candidate scores, wherein each rectangle is compatible with areference cell pair, wherein each rectangle arises from a horizontalstart subset, wherein the width of each rectangle is the same as thesize the horizontal start subset from which each rectangle arose,wherein each FSA has one or more pure soss fast cell blocks, whereineach pure block has the same length and width, wherein the same lengthand width is the integer c, wherein each pure block is congruent with areference cell pair, wherein c is less than n, wherein n is greater than250.
 12. A method as in claim 11; wherein, wherein c is the closestinteger number to {square root}n, or wherein c is the closest integernumber to log n, wherein the log is any base, or wherein c=c₁, c₁ beinga positive integer constant.
 13. A method as in claim 10, wherein eachFSA uses a least square type measure of homogeneity or inhomogeneity.14. A method as in claim 11, wherein each FSA uses a least square typemeasure of homogeneity.
 15. A method as in claim 14, wherein the measureof homogeneity is the measure of a segment homogeneity that is the sumof squared deviations of the data points within the segment about theirmean.
 16. A method as in claim 15, wherein (1) each data object is amolecular data object and each descriptor is a molecular descriptor andthe response for each object is a molecular property; or wherein (2)wherein each data object is an individual creature or tissue from acreature and each descriptor is (a) a combination of one or moregenotypes at one or more polymorphisms or (b) a combination of one ormore alleles at one or more polymorphisms or (c) a combination of one ormore haplotypes, and the response for each object is a phenotypiccharacteristic.
 17. A method as in claim 16, wherein each data object isa molecular data object and each descriptor is a molecular descriptorand the response for each object is a molecular property and one or moreof the descriptors is a geometry-based molecular descriptor.
 18. Amethod as in claim 17, wherein each data object is a molecular dataobject and each descriptor is a molecular descriptor and the responsefor each object is a molecular property, and the property is abiological or pharmaceutical property.
 19. A method as in claim 18,wherein c is the closest integer number to {square root}n.
 20. A methodas in claim 19, wherein the property is a positive or negative drugresponse.
 21. A method as in 16, wherein each data object is a humanbeing or tissue from a human being, and each descriptor is (a) acombination of one or more genotypes at one or more polymorphisms or (b)a combination of one or more alleles at one or more polymorphisms or (c)a combination of one or more haplotypes, and the response for eachobject is a phenotypic characteristic.
 22. A method as in claim 21,wherein the response for each object is a drug effect.
 23. A method asin claim 22, wherein c is the closest integer number to {square root}n.24. A method as in claim 22 wherein the phenotypic response for eachobject is a positive or negative drug response.
 25. A method as in claim24, wherein c is the closest integer number wherein c is the closestinteger number to {square root}n.
 26. A computer readable mediumcontaining a computer software program for controlling a computer-basedSegmentation/Recursive Partitioning process or method for generating anodal tree or equivalent data structure and displaying the nodal tree ona monitor or equivalent device or placing the nodal tree in or on acomputer readable medium or transmission signal, wherein theSegmentation/Recursive Partitioning process uses one or more specialdynamic programming segmenting algorithms.
 27. A computer readablemedium containing a computer software program as in claim 26, whereinthe method is for clarifying a relationship between a response and oneor more descriptors by generating a data structure, the response andeach descriptor having a value for each data object of a group of n dataobjects, n being a positive integer number greater than 100, the datastructure being a nodal tree or an equivalent thereof, the root of thetree being the group of data objects, comprising: defining a nodaltree-node segmenting procedure (NT-NS Prcdr), comprising i), ii), iii),iv): i)choosing an unsegmented node that has not been previouslysegmented; ii) choosing a node segmentation process for the unsegmentednode; iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and iv)making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv); applying theNT-NS Prcdr to the root node first; applying the NT-NS Prcdr to zero ormore unsegmented nodes of the tree; and displaying the data structuregenerated as a nodal tree or equivalent thereof on a monitor orequivalent device, or placing the nodal tree in or on a computerreadable medium or transmission signal.
 28. A computer readable mediumcontaining a computer software program as in claim 27, wherein each dataobject is a real-world object, and the response and each descriptorvalue for each data object being real world data.
 29. A computerreadable medium containing a computer software program as in claim 28,wherein the special NT-NS Prcdr is an FSA-special NT-NS Prcdr, so thatthe FSA-special NT-NS Prcdr effectively uses one or more FSAs.
 30. Acomputer readable medium containing a computer software program as inclaim 29, wherein the FSA-special NT-NS Prcdr uses one or more FSAs. 31.A computer readable medium containing a computer software program as inclaim 30, wherein the method operates by sending information orreceiving information or a combination of sending and receivinginformation over a medium such as the internet.
 32. A computer readablemedium containing a computer software program as in claim 29, whereineach of the one or more effectively used FSAs has one or more referencecells, wherein a best score subset or an approximate best score subsetis computed for each reference cell.
 33. A computer readable mediumcontaining a computer software program as in claim 30, wherein (1) eachdata object is a molecular data object and each descriptor is amolecular descriptor and the response for each object is a molecularproperty or wherein (2) each data object is an individual creature ortissue from a creature and each descriptor is a genetic makeupdescriptor and the response for each object is a phenotypiccharacteristic.
 34. A computer readable medium containing a computersoftware program as in claim 30, wherein each data object is anindividual creature or tissue from a creature and each descriptor is (a)a combination of one or more genotypes at one or more polymorphisms or(b) a combination of one or more alleles at one or more polymorphisms or(c) a combination of one or more haplotypes, and the response for eachobject is a phenotypic characteristic.
 35. A computer readable mediumcontaining a computer software program as in claim 30, wherein each FSAhas one or more reference cells, wherein a best score subset andhorizontal start subset is computed for each reference cell, and each ofthe FSAs has one or more directional rectangles of same origin candidatescores, and wherein n is greater than
 250. 36. A computer readablemedium containing a computer software program as in claim 30, whereineach FSA has one or more reference cells, wherein a best score subsetand a horizontal start subset is computed for each reference cell,wherein the best score subset and the horizontal start subset of eachreference cell are the same and are the same size, wherein the size is cscores, wherein each of the FSAs has one or more directional rectanglesof same origin candidate scores, wherein each rectangle is compatiblewith a reference cell pair, wherein each rectangle arises from ahorizontal start subset, wherein the width of each rectangle is the sameas the size the horizontal start subset from which each rectangle arose,wherein each FSA has one or more pure soss fast cell blocks, whereineach pure block has the same length and width, wherein the same lengthand width is the integer c, wherein each pure block is congruent with areference cell pair, wherein c is less than n, wherein n is greater than250.
 37. A computer readable medium containing a computer softwareprogram as in claim 30, wherein c is the closest integer number to{square root}n, or wherein c is the closest integer number to log n,wherein the log is any base, or wherein c=c₁, c₁ being a positiveinteger constant.
 38. A computer readable medium containing a computersoftware program as in claim 35, wherein each FSA uses a least squaretype measure of homogeneity or inhomogeneity.
 39. A computer readablemedium containing a computer software program as in claim 36, whereineach FSA uses a least square type measure of homogeneity.
 40. A computerreadable medium containing a computer software program as in claim 39,wherein the measure of homogeneity is the measure of a segmenthomogeneity that is the sum of squared deviations of the data pointswithin the segment about their mean.
 41. A computer readable mediumcontaining a computer software program as in claim 39, wherein (1) eachdata object is a molecular data object and each descriptor is amolecular descriptor and the response for each object is a molecularproperty; or wherein (2) wherein each data object is an individualcreature or tissue from a creature and each descriptor is (a) acombination of one or more genotypes at one or more polymorphisms or (b)a combination of one or more alleles at one or more polymorphisms or (c)a combination of one or more haplotypes, and the response for eachobject is a phenotypic characteristic.
 42. A computer readable mediumcontaining a computer software program as in claim 41, wherein c is theclosest integer number to {square root}n.
 43. An apparatus, wherein theapparatus includes a computer, wherein the apparatus practices acomputer-based Segmentation/Recursive Partitioning process or method forgenerating a nodal tree or equivalent data structure and displaying thenodal tree on a monitor or equivalent device or placing the nodal treein or on a computer readable medium or transmission signal, wherein theSegmentation/Recursive Partitioning process uses one or more specialdynamic programming segmenting algorithms.
 44. An apparatus as in claim43, wherein the method is for clarifying a relationship between aresponse and one or more descriptors by generating a data structure, theresponse and each descriptor having a value for each data object of agroup of n data objects, n being a positive integer number greater than100, the data structure being a nodal tree or an equivalent thereof, theroot of the tree being the group of data objects, comprising: defining anodal tree-node segmenting procedure (NT-NS Prcdr), comprising i), ii),iii), iv): i)choosing an unsegmented node that has not been previouslysegmented; ii) choosing a node segmentation process for the unsegmentednode; iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and iv)making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv); applying theNT-NS Prcdr to the root node first; applying the NT-NS Prcdr to zero ormore unsegmented nodes of the tree; and displaying the data structuregenerated as a nodal tree or equivalent thereof on a monitor orequivalent device, or placing the nodal tree in or on a computerreadable medium or transmission signal.
 45. An apparatus as in claim 44,wherein each data object is a real-world object, and the response andeach descriptor value for each data object being real world data.
 46. Anapparatus as in claim 45, wherein the special NT-NS Prcdr is anFSA-special NT-NS Prcdr, so that the FSA-special NT-NS Prcdr effectivelyuses one or more FSAs.
 47. An apparatus as in claim 46, wherein theFSA-special NT-NS Prcdr uses one or more FSAs.
 48. An apparatus as inclaim 47, wherein the method operates by sending information orreceiving information or a combination of sending and receivinginformation over a medium such as the internet.
 49. An apparatus as inclaim 46, wherein each of the one or more effectively used FSAs has oneor more reference cells, wherein a best score subset or an approximatebest score subset is computed for each reference cell.
 50. An apparatusas in claim 47, wherein each FSA has one or more reference cells,wherein a best score subset and horizontal start subset is computed foreach reference cell, and each of the FSAs has one or more directionalrectangles of same origin candidate scores, and wherein n is greaterthan
 250. 51. An apparatus as in claim 47, wherein each FSA has one ormore reference cells, wherein a best score subset and a horizontal startsubset is computed for each reference cell, wherein the best scoresubset and the horizontal start subset of each reference cell are thesame and are the same size, wherein the size is c scores, wherein eachof the FSAs has one or more directional rectangles of same origincandidate scores, wherein each rectangle is compatible with a referencecell pair, wherein each rectangle arises from a horizontal start subset,wherein the width of each rectangle is the same as the size thehorizontal start subset from which each rectangle arose, wherein eachFSA has one or more pure soss fast cell blocks, wherein each pure blockhas the same length and width, wherein the same length and width is theinteger c, wherein each pure block is congruent with a reference cellpair, wherein c is less than n, wherein n is greater than
 250. 52. Anapparatus as in claim 51, wherein, wherein c is the closest integernumber to {square root}n, or wherein c is the closest integer number tolog n, wherein the log is any base, or wherein c=c₁, c₁ being a positiveinteger constant.
 53. An apparatus as in claim 50, wherein each FSA usesa least square type measure of homogeneity or inhomogeneity.
 54. Anapparatus as in claim 51, wherein each FSA uses a least square typemeasure of homogeneity.
 55. An apparatus as in claim 54, wherein themeasure of homogeneity is the measure of a segment homogeneity that isthe sum of squared deviations of the data points within the segmentabout their mean.
 56. An apparatus as in claim 55, wherein c is theclosest integer number to {square root}n.
 57. A data structure generatedan apparatus, wherein the structure is in, or on a computer readablemedium or transmission signal, and wherein some data of the structureare functionally interrelated, wherein the apparatus includes acomputer, wherein the apparatus practices a computer-basedSegmentation/Recursive Partitioning process or method for generating anodal tree or equivalent data structure and displaying the nodal tree ona monitor or equivalent device or placing the nodal tree in or on acomputer readable medium or transmission signal, wherein theSegmentation/Recursive Partitioning process uses one or more specialdynamic programming segmenting algorithms.
 58. A data structure as inclaim 57, wherein the method is for clarifying a relationship between aresponse and one or more descriptors by generating a data structure, theresponse and each descriptor having a value for each data object of agroup of n data objects, n being a positive integer number greater than100, the data structure being a nodal tree or an equivalent thereof, theroot of the tree being the group of data objects, comprising: defining anodal tree-node segmenting procedure (NT-NS Prcdr), comprising i), ii),iii), iv): i)choosing an unsegmented node that has not been previouslysegmented; ii) choosing a node segmentation process for the unsegmentednode; iii) segmenting the unsegmented node into two or more subgroupsusing the node segmentation process chosen for the unsegmented node inii); and iv)making the unsegmented node a segmented tree parent node andmaking each of one or more of the subgroups of iii) an unsegmented treedaughter node of the segmented tree parent node of iv); applying theNT-NS Prcdr to the root node first; applying the NT-NS Prcdr to zero ormore unsegmented nodes of the tree; and displaying the data structuregenerated as a nodal tree or equivalent thereof on a monitor orequivalent device, or placing the nodal tree in or on a computerreadable medium or transmission signal.
 59. A data structure as in claim58, wherein each data object is a real-world object, and the responseand each descriptor value for each data object being real world data.60. A data structure as in claim 58, wherein the special NT-NS Prcdr isan FSA-special NT-NS Prcdr, so that the FSA-special NT-NS Prcdreffectively uses one or more FSAs.
 61. A data structure as in claim 60,wherein the FSA-special NT-NS Prcdr uses one or more FSAs.
 62. A datastructure as in claim 61, wherein the method operates by sendinginformation or receiving information or a combination of sending andreceiving information over a medium such as the internet.
 63. A methodas in claim 4, the method further comprising: collecting one or moredescriptor values or one or more property values of each of one or moreof the real-world objects by physical measurement or observation.
 64. Amethod as in claim 5, the method further comprising: collecting one ormore descriptor values or one or more property values of each of one ormore of the real-world objects by physical measurement or observation.65. A method as in claim 2, wherein each of over half of the dataobjects is a real-world data object, and the response and eachdescriptor value for each real-world data object is real world data. 66.A computer readable medium containing a computer software program as inclaim 29, wherein the computer readable medium is a transmission signal.67. An apparatus such as in claim 46, wherein the computer comprises akeyboard, a display device, a pointing device, a RAM, a ROM, a CPU and astorage device such as a hard drive.
 68. An apparatus such as in claim47, wherein the computer comprises a keyboard, a display device, apointing device, a RAM, a ROM, a CPU and a storage device such as a harddrive.